CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms

CREaTor predicts cell type-specific gene expression in unseen cell types

CREaTor consists of two transformer models at different resolutions (Fig. 1a and Additional file 1: Fig. S1). Transformer is a deep learning architecture that has been demonstrated as a powerful tool for natural language processing [30,31,32], computer vision [33, 34], and biological modeling [27, 35, 36]. A core component of a transformer is the self-attention module, which extracts sequence-level information by modeling the interactions between elements at different positions in the sequence [30]. In CREaTor, the lower-level transformer (element encoder) learns the latent representation for each cCRE from the DNA sequence and chromatin states of the element itself, while the upper-level transformer (regulation encoder) predicts gene expression from a collection of cCRE latent representations flanking the target gene. Self-attention extracted from the regulation encoder is used to interpret the cCRE-gene and cCRE-cCRE interactions.

Fig. 1figure 1

Accurate gene expression prediction with CREaTor. a Schema of CREaTor. The model predicts target gene expression from the flanking cCREs with a hierarchical transformer structure. Localization of cCREs was obtained from ENCODE consortium. A combination of genomic sequences, chromatin accessibility, and a collection (3–13 types) of ChIP-seq profiles was used as input features for each cCRE. b Visualization of data split strategy: we trained our model on gene expression of 19 autosomes from 19 different cell lines and tissues respectively. Genes on chr16 from the 19 cell lines and tissues were used for parameter tuning (validation), while genes on chr8, 9 were used for model evaluation (in-cell type test chromosomes). Genes from all autosomes in K562 (cross-cell type test chromosomes) were detailly evaluated to demonstrate the model’s ability on cross-cell type gene expression and regulation modeling (Additional file 1: Fig. S2). c Pearson r between observed and predicted expression of genes. Left: Pearson r between observed and predicted expressions of genes on cross-cell type test chromosomes. Right: Pearson r between observed and predicted expressions of genes on in-cell type test chromosomes. Green and blue dots indicate chr8 and 9 respectively. See Additional file 2: Table S3 for results with different random seeds. d Clustering map of predicted and observed expression of K562 specific genes (calculated with RSME; see the “Methods” section) in different cell types. The leftmost column is the predicted value, which is clustered with the K562 observed gene expression data using the hierarchical clustering method. Expression values were transformed with log1p. Observed gene expression profiles from different sources (with different experiment IDs on ENCODE) for the same cell type are calculated independently

We trained CREaTor on 19 human tissues and cell lines whose annotated cCRE information was available in SCREEN Registry [16] (Additional file 2: Table S1). In each cell type, chr16 was left out for validation, chr8 and 9 were left out for testing (in-cell type test chromosomes), and all other autosomes were used for training (Fig. 1b and Additional file 1: Fig. S2). By training the model with data from multiple cell types jointly, we expected the model to learn general rules guiding gene regulation across cell types. Next, we evaluated CREaTor’s performance on autosomes of the K562 cell line (cross-cell type test chromosomes), which were unseen by the model, to demonstrate the generalizability of our method (Fig. 1b and Additional file 1: Fig. S2). For the in-cell type test chromosomes, CREaTor reached a mean correlation of 0.850 and 0.818 (Pearson r) for chr8 and 9 respectively (Fig. 1c and Additional file 2: Table S2). While for the cross-cell type test chromosomes, the correlations between observed and predicted gene expression on different chromosomes ranged from 0.756 to 0.936, with a mean correlation of 0.902 (Pearson r) (Fig. 1c and Additional file 2: Table S3). Notably, the predictive accuracy of K562 chr8 and 9 (0.839 and 0.810 respectively, Pearson r) was comparable to that of in-cell type test chromosomes, suggesting that CREaTor can predict gene expression efficiently from cCREs in new cell types.

However, the performance gap between chr8/9 and other chromosomes in K562 is non-trivial. We reasoned that the presence of housekeeping genes and several hematopoietic cell types in training data alleviated the challenge for expression prediction on chromosomes other than 8 and 9. To assess the generalizability of CREaTor more rigorously, we next examined if CREaTor could make cell type-specific predictions. With gene differential expression (GDE) analysis on paired data between K562 and each of the 19 cell types used for model training respectively, we identified 410 genes that were differentially expressed in the K562 cell line (see the “Methods” section). For a number of these genes, including hematopoietic regulators KLF1 and TAL1 and hemoglobin subunit protein HBE1, CREaTor made a prediction on rival with experimental quantifications (Additional file 1: Fig. S3). In addition, single-linkage clustering analysis demonstrated that the prediction on K562 differentially expressed genes was more similar to observed K562 expression compared to other cell types (Pearson r = 0.68, Fig. 1d).

To further demonstrate that the accurate prediction of K562 expression is not attributed to the similarity between K562 and training cell types, we compared the predicted expressions with 122 observed expression profiles of 20 distinct cell types. These profiles included 12 profiles from 3 independent K562 RNA-seq experiments that our model had not previously encountered. We visualized all expression profiles with Uniform Manifold Approximation and Projection (UMAP) in a 2-dimensional space. For different chromosome subsets, predicted expression consistently exhibits high similarity to K562, as opposed to other cell types, including those sharing hematopoietic origins with K562 (Additional file 1: Fig. S4). Also, we conducted leave-one-chromosome-out and leave-one-cell type-out experiments to confirm that CREaTor’s superior performance was not limited to chr8-9 and K562, respectively (Additional file 1: Fig. S5 and Additional file 2: Table S4a-b).

It has been reported that histone modifications and DNA openness proximal to gene transcription start sites (TSS) are significantly correlated with active transcription [37]. To demonstrate that distal information contributes to model performance, we compared models trained with cCREs up to 2 kb, 5 kb, 10 kb, 100 kb, or 1 Mb away from the TSS of target genes. Performance improved with increasing candidate window sizes (Additional file 1: Fig. S6), suggesting that CREaTor predicts gene expression from both proximal and distal cCREs. Also, this result indicates that distal cCREs are substantial for accurate expression prediction, supporting the importance of long-range cis-regulatory interactions in gene regulation. But meanwhile, it is worth noting that the model trained with cCREs up to 2 kb away from target genes performed significantly better than random guesses, consistent with the knowledge that the proximal functional genomes and cCREs are closely related to gene expression.

Self-attention reveals functional cCREs in unseen cell types

Attention weights between cCREs and target genes extracted from CREaTor (see the “Methods” section) may be exploited to interpret the importance of each cCRE to genes. To test this hypothesis, we benchmarked CREaTor against 3 CRISPR-based experimental-validated K562 enhancer-target gene datasets [13,14,15]. To be noted, criteria for candidate enhancers vary in each study and few enhancer-target gene pairs tested were shared among studies (Additional file 1: Fig. S7 and Additional file 3: Table S5). Thus, we combined the experimental results and identified 1859 putative enhancers related to 328 genes that were tested by both the experimental approaches and CREaTor across the K562 genome. CREaTor prioritizes positive enhancer-gene pairs to negative ones with larger attention scores (auROC = 0.834, auPRC = 0.620; Fig. 2a, b), and the performance is further improved when we adjusted the attention scores with enhancer-gene genomic distances (auROC = 0.843, auPRC = 0.667; Fig. 2a, b). In addition, we compared the scores derived from the attention weights of CREaTor with a quantitative analysis of enhancer effects as described in a previous study [13]. In this study, the enhancer effect on gene expression was defined as the change in gene expression upon enhancer knockdown using CRISPR perturbation. Consequently, the quantitative effect is inversely related to the enhancer activity. In line with this understanding, we observed a negative correlation, with a Spearman ρ of  -0.269, between the CREaTor scores and the quantitative observations (Fig. 2c), implying that CREaTor captures quantitative effects of cCREs to genes.

Fig. 2figure 2

Attention matrix of CREaTor implies cCRE-gene interactions. a, b auROC (a) and auPRC (b) of CREaTor outperform its counterparts on cCRE-gene pair classification. Attention (attn., yellow): normalized attention weights (genes to cCREs) in CREaTor. Adjusted attention (adj. attn., red): attention scores/log10 (distance). H3K27ac/dist (blue): approximate of the ABC score. Distance quantifies relative genomic distances between genes and cCREs. H3K27ac value of a cCRE is calculated as the sum of H3K27ac peak values of the element. Labels (positive/negative) of cCRE-gene pairs were collected from 3 independent CRISPR perturbation experiments [13,14,15]. c Attention scores derived from attention weights are significantly correlated with the effect of enhancer on gene expression quantified by Fulco et al. [13]. As the quantification measures the change of target gene expression upon enhancer knock-down using CRISPR perturbation, the quantitative effect values are inversely related to enhancer activities. d, e auPRC (d) and auROC (e) of CREaTor and its counterparts on the classification of cCRE-gene pairs collected from a Pol-II mediated ChIA-PET experiment. The performance is evaluated for each gene and each distance group separately. Groups with < 10 samples were filtered out. Center line, median; box limits, upper and lower quartiles; whiskers, 1.5 × interquartile range; points, outliers. f MYC locus showing predicted and previously reported regulators in K562 cells. For CREaTor (red) and H3K27ac/distance (gray), peaks on the tracks represent the scores of different cCRE regions. Enhancers track (red squares) denotes reported active regulators of MYC. Representative DNase, H3K4me3, H3K27ac, and CTCF tracks, as well as ChIA-PET interactions in K562, are also annotated

We also compared CREaTor with 4 methods previously used for cCRE-gene interaction modeling: (1) predictions based solely on genomic distances between cCREs and genes; (2) predictions based on cCRE H3K27ac signals and cCRE-gene distances (approximate version of the Activity-by-Contact (ABC) score [13]). These model-free approaches can estimate activities of cCREs spanning varying ranges without prior knowledge of cis-regulatory programs in any cell types, or cell types with H3K27ac quantifications, which align well with the setting of CREaTor. Evaluated on a comprehensive set of metrics, CREaTor outperforms both methods at different distance groups (Fig. 2a, b and Additional file 1: Fig. S8). In addition, we compared CREaTor to 2 state-of-the-art deep learning approaches, (3) Enformer [27] and (4) GraphReg [28]. Both Enformer and GraphReg, trained with supervised gene expression prediction tasks, support zero-shot cCRE-gene interaction prediction. However, Enformer’s architecture limits it from long-range enhancer-gene interaction prediction, as the released Enformer model can only predict interactions up to 200 kb. Additionally, it cannot generalize to new cell types as it solely relies on genomic sequences for predictions. To simulate prediction tasks in new cell types, we adopted the cell-type-agnostic setting of Enformer (see the “Methods” section). As expected, predicting enhancer-gene interactions in new cell types with Enformer is not favorable (Fig. 2a, b and Additional file 1: Fig. S8). GraphReg, on the other hand, predicts CAGE signals from 1D epigenomic data and 3D genomic structures, allowing it to generalize to new cell types. However, its dependency on 3D genomic structures and CAGE profiles narrows its applicability. To evaluate GraphReg, we trained an enhanced GraphReg model using 9 cell types and 16 types of epigenomic profiles from scratch and derived feature importance to estimate enhancer activities in K562 as suggested by the original study (see the “Methods” section). Our results show that CREaTor greatly outperforms GraphReg (Fig. 2a, b and Additional file 1: Fig. S8), suggesting the superiority of CREaTor’s design.

Next, cCRE-gene interactions discovered by CREaTor were further benchmarked against a genome-wide Pol II-mediated ChIA-PET dataset [38]. Compared with CRISPR perturbation studies, ChIA-PET covers a broader range of genes and regulators, thus capturing more comprehensive interactions between genes and regulators. We recovered 6,132,740 cCRE-gene pairs (both positive and negative) across the K562 genome from ChIA-PET. To benchmark CREaTor and its counterparts, for each gene, we calculated auROC and auPRC of the corresponding cCRE-gene pairs stratified by their relative genomic distances. Among all, CREaTor shows the highest median auROC and auPRC for gene collections at all distance groups and greatly outperforms Enformer and GraphReg (Fig. 2d, e). Strikingly, CREaTor performs substantially better at groups spanning longer ranges.

Since ChIA-PET captures physical proximities between genomic regions, false positives exist when active CRE-gene pairs are recovered from ChIA-PET. To benchmark our method more comprehensively, we calculated the precision and specificity scores for different methods considering that these metrics are less impacted by false positives. Consistently, CREaTor outperforms other methods (Additional file 1: Fig. S9), indicating that CREaTor can capture cCRE-gene interactions efficiently from genomic features flanking target genes in unseen cell types.

Lastly, we examined if our model recovered regulators of the oncogenic gene MYC (chr8: 127,735,434–127,742,951). cCREs of MYC disperse along genomic sequences to as far as 2 Mb downstream MYC TSS and active MYC regulators in K562 were identified by previous studies with various approaches [39,40,41]. Therefore, we examined if CREaTor could pinpoint these regulators accurately. The result indicates that CREaTor prioritizes positive MYC cCREs with larger attention scores and captures active cCREs missed by other predictive approaches (Fig. 2f). In addition, 2 groups of sharp peaks are observed 2 Mb downstream MYC TSS (Fig. 2f), in concordance with the existence of 2 distal super-enhancer regions of MYC. Since MYC is in both in- and cross-cell type test sets, we believe that CREaTor has learned general rules guiding cCRE-gene interactions in different cell types, rendering it an efficient tool for cCRE activity modeling in unseen cell types.

CREaTor captures chromatin domain boundaries in unseen cell types

Three-dimensional (3D) chromatin folding allows physical interactions between distal cCRE and genes and the information can also guide gene regulation modeling [13, 28, 42]. Without incorporating 3D chromatin folding information in our model, we were curious to see if CREaTor captured the topological structure of the genome, considering that CREaTor precisely recovers cCRE-gene interactions even of long ranges.

Attention matrices extracted from the model imply not only the interactions between cCREs and genes, but also relationships between cCRE-cCRE pairs. To examine if the attention matrix reflects contact frequency between elements, we aggregated the attention matrix at 10 kb resolution for each gene in K562 and compared the results to a high-resolution Hi-C study [10]. In addition to observing similar checkerboard patterns between the attention matrix and Hi-C (Fig. 3a), we systematically evaluated the consistency between the attention matrix and topologically associating domains (TADs) by analyzing insulation scores (see the “Methods” section). We calculated insulation scores from the attention matrix over 12,584 K562 TAD boundaries defined in a recent study [43]. The average score across the genome shows a clear insulation pattern on boundaries, similar to that calculated from the Hi-C experiment (Fig. 3b). Meanwhile, no significant decrease over GM12878-specific TAD boundaries [43] is observed with insulation scores calculated from either K562 attention matrix or K562 Hi-C (Fig. 3b), demonstrating that CREaTor captures K562-specific TAD boundaries. Together, the results show that CREaTor can infer cell type-specific topological structures of genomes in cells unseen by the model.

Fig. 3figure 3

CREaTor captures hierarchically higher-order genome organizations. a Example genomic regions showing the similarity between attention matrix (above the diagonal) and Hi-C contact matrix (below the diagonal). Orange boxes indicate TAD domains. b Average insulation scores across the K562 genome at 10-kb resolution calculated from attention matrix and Hi-C. Blue line and left y-axis: insulation scores of attention matrix. Pink line and right y-axis: the insulation scores of Hi-C. Solid lines indicate insulation scores over K562 TAD boundaries and dashed lines indicate insulation scores over GM12878 boundaries. The x-axis is centered on TAD boundaries. c Upper panel: Statistics of attention weights between CTCF-bound element pairs with different topological relationships. Center line, median; box limits, upper and lower quartiles; whiskers, 1.5 × interquartile range. Lower panel: illustration of CTCF-bound element pairs used for the analysis. The red triangle represents TAD domains called from the Hi-C matrix (blue). d Average attention scores between elements without normalization. p-value is calculated with Mann–Whitney U test

We reason that CREaTor infers genome structures by learning the insulating behaviors of CTCF-bound elements. Consistently, we found that paired CTCF-bound insulators flanking the same TAD domain showed significantly larger attention scores compared to either unmatched insulator pairs spanning multiple TADs, or pairs involving non-insulator CTCF-bound elements (Fig. 3c). Thus, CREaTor may predict gene expression and capture cCRE-gene regulation efficiently by modeling topological patterns of the genome.

CREaTor implies directional regulation between cCREs

It is long proposed that enhancers form hierarchical relationships with each other, yet the relationship is challenging to be disentangled with biological experiments. For example, Carleton et al. developed an enhancer interference technique (Enhancer-i) to study the combinational effects of distal regulatory regions on genes [44]. They showed the interdependence between CISH-1 and CISH-2, two estrogen receptor α-bound enhancers of the cytokine signaling suppressor gene CISH. However, the detailed mechanism between interactions of CISH-1 and CISH-2 could not be elucidated. Here, we examined attention scores between cCREs within CISH-1 and CISH-2 regions (denoted as Cr1 and Cr2 respectively) and found that the attention from Cr1 to Cr2 is significantly larger than the other way around (Fig. 3f). To rule out potential distance bias, we examined attention score distribution of 5773 genes whose cCRE-gene distances were similar to Cr1-CISH and Cr2-CISH (denoted as SDaCr1 and SdaCr2). Remarkably, no directional preference between SDaCr2 and SDaCr2 was observed (Fig. 3f). Therefore, our results indicate that there could be a directional relationship between CISH-1 and CISH-2, which is driven by hierarchical regulation of enhancers. We thus believe that with further development, CREaTor has the potential to become a powerful tool for understanding the causal relationships within enhancer networks.

cCRE representations learned by CREaTor suggest a new role of CTCF-bound elements

To investigate how CREaTor perceives cCREs and their roles during gene regulation, we clustered cCREs by the 256-dimensional cCRE representations extracted from CREaTor and examined features enriched in each group.

While different cCRE types are enriched in different clusters (Fig. 4a, b), cCRE representations learned by our model better capture functional variations of elements compared to the classification of ENCODE. For instance, while cCREs are aggregated into 6 clusters, both cluster 0 and cluster 1 are enriched with proximal enhancer-like elements, cCREs that show enhancer-like signatures falling within 200 bp of an annotated transcript start site (TSS) [16]. However, proximal enhancer-like elements in cluster 0 are enriched for RNA polymerase II (Pol II) signals (Additional file 1: Fig. S10), markers of active transcription events, compared to those in cluster 1. Since promoter-like elements are also enriched in cluster 0 and enhancers are believed to be able to contribute to promoter activities [45], we reason that CREaTor learns the discrepancies between enhancer-like elements of different roles and therefore associates a subgroup of proximal enhancer-like elements with promoters. Meanwhile, the fuzzy boundaries between clusters may indicate the adaptable functions of elements for gene regulation captured by our model.

Fig. 4figure 4

cCRE representations learned by CREaTor suggest a new role of CTCF-bound elements. a Uniform Manifold Approximation and Projection (UMAP) of cCRE embeddings in K562. Upper: colored and numbered as clusters grouped by the Leiden algorithm. Bottom: colored and labeled by element type. b Composition of different element types in each cluster by percentage. Proximal elements: elements falling within 200 bp of an annotated TSS. Distal elements: elements more than 200 bp from any annotated TSS. Promoter-like: elements with high DNase and H3K4me3 signals. Enhancer-like: elements with high DNase and H3K27ac signals. CTCF-only: elements with high DNase and CTCF signals, as well as low H3K4me3 and H3K27ac signals. c Fold change of histone marker peaks of given types of cCREs in cluster 5 with respect to those in other clusters. Top: all cCREs. Middle: distal enhancer-like elements. Bottom: CTCF-only elements. d Expression value (log1p) distribution of genes within 10 kb of different types of CTCF-bound elements. e Average signals of H3K36me3, H3K79me2, H4K20me1, H2AFZ, H3K4me1, and H3K27me3 on different types of CTCF-bound elements. f Illustration for the proposed model of CTCF-H3K36me3 elements promoting transcription elongation

CTCF-only cCREs, which lack both enhancer-like signatures and promoter-like signatures, are more isolated from other elements, consistent with their insulator and looping functions (Fig. 4a). However, CTCF-only cCREs are clustered into 2 separate groups, while a subgroup of CTCF-only cCREs is aggregated with distal enhancer-like elements in cluster 5 (Fig. 4b). Compared to other clusters, cluster 5 shows a significant enrichment of H3K36me3 peaks (Fig. 4c), a histone modification associated with diverse functions in conjugation with different types of epigenetic markers [42, 46,47,48,49,50], indicating a higher chromatin activity of these elements. Consistent with the result, genes close to CTCF-only cCREs in cluster 5 (denoted as CTCF-H3K36me3 elements) show higher expression values compared to those close to low H3K36me3 CTCF-only elements (Fig. 4d), suggesting a more active role in gene transcription of CTCF-H3K36me3 elements.

Depletion of repressive histone modification H3K27me3 also supports the greater activity of CTCF-H3K36me3 elements (Fig. 4e). Other from H3K36me3, CTCF-H3K36me3 elements are enriched with H3K79me2 and H4K20me1 (Fig. 4e), a pattern that has been previously reported to be associated with active transcription and splicing of exons [46]. Meanwhile, CTCF-H3K36me3 elements show increased H3K4me1 and H2AFZ signals (Fig. 4e), both of which are associated with enhanced transcription elongation [51, 52]. Considering a majority of CTCF-H3K36me3 elements are located outside exon regions, we propose that CTCF-H3K36me3 elements promote transcription elongation by serving as binding hubs for various cis- and trans-regulatory elements (Fig. 4f), which are captured by CREaTor for cross-cell type gene regulation modeling.

留言 (0)

沒有登入
gif