TargetRNA3: predicting prokaryotic RNA regulatory targets with machine learning

Features of target interactions

In order to evaluate features indicative of interactions between sRNAs and their regulatory targets, we compiled a set of 4386 sRNA:target interactions for which there is experimental evidence. The 4386 interactions come from 77 sRNAs in 13 different genomes from 4 phyla (Additional file 1: Table S1). For the 77 sRNAs, we also looked at possible targets in their corresponding genomes for which we did not find experimental evidence of interaction. There are 325,162 pairs of sRNAs and possible targets in the 13 genomes without evidence of interaction. We consider these 325,162 pairs as non-interactions. Of course, some of these pairs that we label as non-interacting may indeed be regulatory interactions for which we have not yet found evidence. Thus, the false-positive rates we ultimately report may be over-estimates. Nonetheless, since most sRNAs have regulatory interactions with only a small percentage of all possible targets from their genome, we hypothesize that the number of false-negative labels is relatively modest.

For each of these 329,548 pairs of sRNAs and possible targets, we calculated values for 111 features that may be predictive of sRNA:target interactions. Most of the features have been used in other studies to predict interactions, though a few are new to this study. For instance, 64 of the features correspond to trinucleotide frequency differences as used by sRNARFTarget, and 17 of the features correspond to properties of the IntaRNA-predicted hybridization as suggested by sInterBase [25]. The complete set of 329,549 pairs of sRNAs and possible targets together with each of their 111 feature values is available in Additional file 1: Table S2, and details on the features are provided in Additional file 1: Table S3.

We then investigated combinations and subsets of the 111 features as well as the relationship of each feature with interactions and non-interactions (Additional file 1). For each feature, we used analysis of variance (ANOVA) to calculate its F-statistic and corresponding p-value demonstrating the feature’s relationship to whether interactions are evinced or not (Fig. 1) [26]. As Fig. 1 illustrates, some features are not informative in distinguishing interactions from non-interactions. For example, the existence of seed regions of length 8, 9, or 10 base pairs, which are used in several existing prediction tools and which correspond to consecutive base pairs in the sRNA and in the possible target that are perfectly complementary, does not contain substantial predictive power (p-values of 0.57, 0.031, and 0.43, respectively). In contrast, features related to homology appear to be important. Features capturing the conservation of a sRNA and its possible target have significant p-values as do the two features from CopraRNA, a tool which makes heavy use of homology in computing its p-value and false discovery rate. Similarly, features relating to the binding energy of a sRNA and possible target tend to be significant.

Fig. 1figure 1

Relationship of features to evinced interactions. The F-statistic and corresponding p-value, as calculated using analysis of variance, are shown for each feature except for the 64 trinucleotide frequency differences. Higher F-statistics and lower p-values (more darkly shaded regions in the figure) indicate how well the feature discriminates interactions from non-interactions. For comparison, the first row shows the F-statistic and p-value for the probabilities reported by TargetRNA3

Based on the significance of different features in distinguishing interactions from non-interactions (Fig. 1) and the efficiency of calculating different features (Additional file 1: Table S3), we selected a subset of nine features that capture the key aspects of separating interactions from non-interactions and that can be computed rapidly. The nine features are shown in Additional file 1: Fig. S3 with their relationship to whether interactions are evinced or not based on ANOVA (Additional file 1: Fig. S3A) and based on correlation coefficient (Additional file 1: Fig. S3B).

Machine learning algorithms

Using our set of 329,548 pairs of sRNAs and possible targets, we explored 8 different machine learning algorithms and evaluated each algorithm for its ability to accurately identify sRNA:target interactions. Once trained, each algorithm reports a probability that any sRNA and possible target genuinely interact. Figure 2 shows the receiver operating characteristic (ROC) curves for the eight machine learning algorithms, indicating the trade-off between sensitivity (i.e., true-positive rate) and specificity (i.e., 1.0 − false-positive rate) at different probability thresholds, and Additional file 1: Table S4 provides additional statistics, including area under the ROC curve, F1 score, and Matthews correlation coefficient, indicating each algorithm’s performance. Based on these results (Fig. 2 and Additional file 1: Table S4), we found that the gradient boosting algorithm was one of the best performing at any threshold and, particularly, at probability thresholds corresponding to very low false-positive rates such as false-positive rates of 0.05 (the left-most region of Fig. 2). Given its robustness at different thresholds, its performance at low false-positive rates that are most relevant to target prediction, and its speed, we selected the gradient boosting algorithm for more careful investigation and as the basis for TargetRNA3.

Fig. 2figure 2

ROC curves showing the performance of different machine learning algorithms. The performance of 8 machine learning algorithms is illustrated by ROC curves. The abscissa axis corresponds to the false-positive rate, i.e., 1.0 − specificity. The ordinate axis corresponds to the true positive rate, i.e., the recall or sensitivity. Different thresholds for the values reported by an algorithm represent different points along the algorithm’s curve in the figure. The dotted line with unit slope indicates the performance of a naïve random algorithm. For each algorithm, the area under the curve (AUC) is indicated

After identifying gradient boosting as the best of the eight algorithms that we considered for target prediction, we examined how its performance compared to that of an automated machine learning (AutoML) system, namely auto-sklearn [27]. auto-sklearn is a popular AutoML system that uses meta-learning and Bayesian optimization to determine the optimal learning algorithms and their associated hyperparameter optimizations in a combined search space. Thus, in contrast to a single machine learning algorithm such as gradient boosting, auto-sklearn explores a large set of algorithms and not just individually but in combinations as part of ensembles while simultaneously optimizing their parameters. Figure 3 shows the ROC curves for gradient boosting, which is used by TargetRNA3, and for both auto-sklearn [27] and auto-sklearn 2.0 [28]. While the AutoML approaches perform better than gradient boosting at most probability thresholds, their performance is comparable to gradient boosting at thresholds corresponding to very low false-positive rates, which are our foci when predicting sRNA:target interactions.

Fig. 3figure 3

ROC curves comparing the performance of TargetRNA3 with AutoML. The performance of TargetRNA3 and two AutoML systems, Auto-Sklearn and Auto-Sklearn version 2.0, is illustrated by ROC curves. The abscissa axis corresponds to the false-positive rate, i.e., 1.0 − specificity. The ordinate axis corresponds to the true-positive rate, i.e., the recall or sensitivity. Different thresholds for the values reported by an algorithm represent different points along the algorithm’s curve in the figure. The dotted line with unit slope indicates the performance of a naïve random algorithm. For each algorithm, the area under the curve (AUC) is indicated

Having examined different machine learning algorithms and their performance, we wanted to better understand the relative contribution of each feature toward distinguishing interactions, so we performed a SHAP (SHapley Additive exPlanations) analysis, which enables global measures of feature importance for a machine learning model [29]. Figure 4 illustrates the impact of different features on TargetRNA3’s predictions based on Shapley values. As indicated in Fig. 4, some features such as the energy of hybridization of the two interacting RNAs as determined by RNAplex (blue values in Fig. 4A for this feature correspond to large negative energies) and the number of sRNA:target homologs (red values in Fig. 4A for this feature correspond to large numbers of homologs) contribute more toward TargetRNA3’s predictions and some features such as whether the stop codon of a target’s upstream gene overlaps the target’s start codon contribute little toward TargetRNA3’s predictions.

Fig. 4figure 4

Contributions of features used by TargetRNA3. The results of SHAP analyses are shown indicating the contributions of features used by TargetRNA3 when making predictions. A For each of the nine features, the feature’s impact on the machine learning model’s output is shown by the distribution of the feature’s Shapley values. B For each of the nine features, the maximum absolute Shapley value over all interactions is indicated

Comparison with other target prediction methods

To assess how TargetRNA3 compares to other approaches for predicting sRNA:target interactions, we interrogated the performance of TargetRNA3, CopraRNA [19], RNAup [17], IntaRNA [18], RNAplex [30], and sRNARFTarget [21]. It is worth noting that CopraRNA has shown some of the best performance in the past at identifying target interactions [15], and sRNARFTarget is a recent approach for target prediction that also employs machine learning and uses a set of features unique among the tools—namely the difference in frequency for each of the 64 trinucleotides between the sRNA sequence and a possible target sequence. Detailed scores reported by each of these 6 algorithms on all 329,548 pairs in our dataset are reported in Additional file 1: Table S2. Figure 5A illustrates the ROC curves for each of the six algorithms as well as the area under the curve (AUC) for each. While ROC curves show true-positive rate and false-positive rate performance at different thresholds, we are particularly interested in low false-positive rates, so we considered the true-positive rate (i.e., sensitivity) of each of the six algorithms at a specific false positive rate of 0.05 (Fig. 5B). We also probed the runtime, per sRNA, of each of the six algorithms (Fig. 5C). As shown in Fig. 5 and Additional file 1: Table S5, TargetRNA3 had the best performance overall and critically at low false-positive rates. TargetRNA3 has the added benefit of one of the fastest runtimes, which is not accidental, since we selected 9 features out of 111 for TargetRNA3 where runtime of computing a feature was one of the considerations in selecting it.

Fig. 5figure 5

Performance comparison of TargetRNA3 and existing tools for predicting targets of sRNA regulation. The performance of TargetRNA3 and five existing tools (CopraRNA, RNAup, IntaRNA, SRNARFTarget, and RNAplex) when predicting sRNA targets is shown. A ROC curves for the six tools are illustrated. The abscissa axis corresponds to the false-positive rate, i.e., 1.0 − specificity. The ordinate axis corresponds to the true-positive rate, i.e., the recall or sensitivity. Different thresholds for the values reported by a tool represent different points along the tool’s curve in the figure. The dotted line with unit slope indicates the performance of a naïve random tool. B A particular point along each curve in A, specifically the point at which each of the six curves intersects the vertical line corresponding to a false-positive rate of 0.05. B The sensitivity, i.e., recall or true-positive rate, is shown for the six tools when their specificity is 95%, i.e., their false-positive rate is 0.05. C The mean runtime in minutes per sRNA is shown for the six tools, with yellow error bars corresponding to the standard error

留言 (0)

沒有登入
gif