Exploiting epigenomic and sequence-based features for predicting enhancer-promoter interactions

. How to discriminate distal regulatory elements to a gene target is challenging in understanding gene regulation and illustrating causes of complex diseases. Among known distal regulatory elements, enhancers interact with a target gene's promoter to regulate its expression. Although the emergence of many machine learning approaches has been able to predict enhancer-promoter interactions (EPIs), global and precise prediction of EPIs at the genomic level still requires further exploration.In this paper, we develop an integrated EPIs prediction method, called EpPredictor with improved performance. By using various features of histone modifications, transcription factor binding sites, and DNA sequences among the human genome, a robust supervised machine learning algorithm, named LightGBM, is introduced to predict enhancer-promoter interactions (EPIs). Among six different cell lines, our method effectively predicts the enhancer-promoter interactions (EPIs) and achieves better performance in F1-score and AUC compared to other methods, such as TargetFinder and PEP.


Introduction
Enhancers are key cis-elements that regulate spatiotemporal gene expression by contacting with their target genes. The existence of enhancers dramatically increases the complexity of regulatory networks in human and other organisms 1. Therefore, thousands of putative enhancers are mapped in mammalian genomes of different cell types, which outnumber coding genes 23. In many cases, one cognate gene can be controlled by multiple enhancers; in turn, one enhancer can also interact with more than one target gene 4. These create a complicated and nonlinear regulation network. Another aspect of increasing regulation network complexity lies in that enhancers are often located at a tremendous genomic distance away from cognate genes in mammalian and other vertebrate genomes.
In order to better understanding of EPIs, many investigations have been conducted to predict EPIs. Some algorithms use epigenomic features, such as TargetFinder 5, EpiTensor. By reconstructing regulatory landscapes from different features and integrating hundreds of genomics data sets, TargetFinder can accurately predict individual enhancer-promoter interactions using the features from enhancer, promoter, and window region between promoters and enhancer. TargetFinder revealed that the window region is more informative than the promoter and enhancer region; EpiTensor 13 proposes a novel unsupervised computational method to derive 3D interactions between distal genomic loci from 1D epigenomic data. While some only use a sequence as input data and can also gain a high performance to predict EPIs. SPEID is the first deep learning framework that only uses sequence features to predict the enhancer-promoter interaction 6. By integrating four features derived from the sequence, gene expression, and epigenomic features, IM-PET 8 predicts EPIs with a random forest classifier. PEP is an algorithm based on a boosted tree ensemble model to predict long-range EPIs. It consists of two modules (i.e., the PEP-motif and the PEP-word), which use different feature extraction methods. These researches also provide insight into how epigenetic features and sequences correlated to EPIs. Different from all the above methods, we develop a LightGBM-based algorithm to predict enhancer-promoter interactions named EpPredictor. Our result shows that the epigenomic factors-based features plus the sequence-based features, can reliably predict enhancer-promoter interactions and achieve better performance in F1-score and AUC when compared to TargetFinder and PEP.

Datasets
In this paper, we only use the enhancer-promoter pairs dataset from the TargetFinder 5, which include enhancerpromoter positive and negative sample pairs in six cell lines (GM12878, K562, IMR90, HeLa-S3, HUVEC, and NHEK). The distance between enhancer-promoter pairs is between 10 KB and 20 MB. The histone modification and transcription factor ChIP-seq datasets are from the ENCODE Project, including six cell lines. The histone and transcription factor datasets of each cell line contain peak and signal value features in their region.
By analyzing the TargetFinder method's datasets and the PEP method, we find that the positive and negative samples are quite different, and the negative samples are 20 times that of the positive samples. However, the unbalance of the positive and negative samples will lead to biased predictions, which result in overfitting. Therefore, we use all positive samples of the TargetFinder and randomly extract the same negative samples from the negative samples. The ratio between positive samples and negative samples is 1:1, thus solving the sample imbalance.

Determine the region of feature extraction
It is unclear which regions of the enhancer-promoter pairs within a chromosome are useful for predicting EPIs. Our dataset of enhancer-promoter pairs is from TargetFinder, which provides the enhancer's start point, the enhancer's endpoint, the promoter's start points, and the promoter's endpoint. Therefore, we define regions for extract features, as shown in Fig. 1: Note that the distance between the start point of promoter and the end point of promoter is mostly within 2k, and to obtain more useful information, we choose to extend 2k from the right side and left side of the start point of promoter. After defining the feature extracting regions, we map these regions location on chromosome and protein files, and get related sequences' and proteins' information.

Epigenomic features extraction
The features of the data are very essential to the algorithm model. A set of excellent features can well represent the information contained in the data, including invisible and explicit information. Thus, the machine learning model trained by excellent features can have better accuracy, generalization performance, and robustness. We extract multiple sets of features across six cell lines. By analyzing the distribution of enhancers and promoters on chromosomes in different regions at different signal values and peaks, we derive multi-group features based on protein signal value and peak. All four regions (i.e. ER, WR, FR, and PR) contains information about multiple sites of a protein. Therefore, it contains multiple sets of signal value and peak data.

Sequence feature extraction
Despite the epigenomic features, we also want to extract the sequence features from four regions (ER, WR, FR, and PR). Our sequence feature extraction method is the weight matrix, which is a motif descriptor. This method attempts to capture the inherent variability characteristics of the sequence pattern. It is usually composed of an equal number of sequences associated with a set of functional genes. Here we find that the length of each sequence we cut into L=20 is the best for predicting EPIs.
For example, here are the sequences of 3 eukaryotes. We tabulate the frequencies observed for each nucleotide at each position. The calculation of weight matrix as follows: Additionally, each number in this matrix represent the number of times that a given nucleotide has been observed at a given position. For example, nucleotide "A" has been observed in 2 aligned sequences at position 1 and is therefore represented as 2 in the matrix.

Choose LightGBM as well as determine the region and feature selection
LightGBM (i.e., LGB) is higher efficiency, lower memory usage, and fast training speed algorithm. We do some extensive works and compare LightGBM with a support vector machine (SVC) 10 and Adaboost 11 under the epigenome and sequence features. Fig.2 shows that LightGBM achieves the highest F1-score compared to SVC and Adaboost in all six cell lines (i.e., K562, GM12878, IMR90, HeLa-S3, HUVEC, and NHEK). Different features and its related feature extraction regions should be carefully considered. We then evaluate LightGBM 12 with different features (i.e., DNA sequence feature, epigenomic factor feature, epigenomic+sequence feature) on four regions, which shows in Table 4 (i.e., ER, PR, WR, and FR).

Fig2.
Get the optimal algorithm map. In six cell lines, evaluation of F1-score using epigenomic plus sequence features in LightGBM, Adaboost, SVC algorithm

Compare EpPredictor to TargetFinder, and PEP
We use four regions features to evaluate the performance of EpPredictor across six cell lines compared to TargetFinder and PEP-integrate (Fig.3) in terms of F1score AUC and MCC. The result shows that EpPredictor achieves better performance in the F1-score than the other two methods, named TargetFinder (on ER/PR/WR) and PEP (on ER/PR/WR). As details in Table 3, the average F1-score achieve by EpPredictor across six cell lines is 0.88, 5% higher than Target Finder, and PEP (on ER/PR/WR) (0.83 and 0.83). The F1-score is a significant improvement in all cell lines. In the K562 cell line, the F1score of EpPredictor, TargetFinder (on ER/PR/WR), PEP (on ER/PR/WR) are 0.91, 0.85, and 0.82, respectively, which is increased 6% -9% by EpPredictor. Also, in the GM12878 cell line, the F1-scores of EpPredictor is 0.85, which is 4% higher than TargetFinder.  The enormous improvement observes in HUVEC, where the F1-scores of EpPredictor is 10%-12% higher than TargetFinder and PEP. As for the AUC, EpPredictor outperforms TargetFinder in the six-cell lines with about 4%-10% on AUC. and, EpPredictor similar PEPinteraction in three out of the six cell lines (HeLa S3, HUVEK, and NHEK) (0.96, 0.94, and 0.98, respectively) The results in Table 4 indicate that LightGBM achieves better performance when selecting the epigenomic and sequence features. While the performance is not as expected if only selecting DNA sequence features. From Table 4, we observe that in five cell lines (i.e., GM12878, K562, HeLa-S3, HUVE. For instance, in GM12878, LightGBM reaches 0.8616, which is the best in this column. The only exception is in IMR90, where it got 0.8371, which is slightly lower than 0.8418 under the window region. The full region (FR) is more informative than the other three regions when incorporating epigenomic and sequence features (Table 2).
Interestingly, although EpPredictor achieves the higher F1-score and AUC, the performance of MCC is not as expected. In particular, in IMR90, TargetFinder (on ER/PR/WR) and PEP (on ER/PR/WR) achieve 0.81 and 0.84 MCC, achieve better performance as compared to EpPredictor (0.69). One of the classifiers may have a higher F1-score value and a lower MCC value, which means that a single score cannot measure all the classifier's advantages and disadvantages. In summary, the EpPredictor has a clear advantage over the TargetFinder and PEP in F1-score. This trend also maintains in the AUC score.

Discussion and Conclusion
We developed EpPredictor to predict EP interaction based on LightGBM. Compared to other models such as SVM and Adaboost, LightGBM achieves higher performance.
Many methods only use the features from the promoter region and enhancer region to predict EPIs. While TargetFinder suggests that the window region's features are more critical to predict EPIs [13,14,15]. Our methods revealed that the full region with sequence and epigenetic features is more efficient in predicting EPIs than the window region. In order to get better performance, EpPredictor integrates epigenomic features which are extracted from the four regions (ER/PR/PR/FR) according to the information of histone and transcription factor, and sequence features which are also extracted from the four regions (ER/PR/PR/FR).