Investigating the overlap of machine learning algorithms in the final results of RNA-seq analysis on gene expression estimation

The Fastqc tool was used for the quality control step. 76 samples passed the quality score which was over 30, while 95 samples did not meet the desired quality grade. Trimmomatic was used to remove areas (trimming) of the reads, whereas the Rsubread package was used to map the reads to the reference genome. BAM files were created and checked for alignment quality and showed a minimum mapping quality of 34, which is sufficient. Quantification of gene expression was performed using Salmon. Training data included the quant.sf file for each of the 129 samples from the Salmon output. We imported quant.sf files using tximport, scaling transcripts-per-million (TPM) using the average transcript length across samples and the library size (length-Scaled TPM), followed by the log2 transformation. In the remaining 42 of the 171 samples there were no quantification results because they showed an error in reading fastq. This is the smallest return of the code from both the right and the left of the reading was (− 2), indicating that the files are not valid and therefore could not be quantified. The percentage of aligned reads from all samples is 71 to 84%. The total number of reads of the samples corresponding to 561,501,702 readings was obtained using DESeq2. The unexpressed genes were filtered, and reasonable values were obtained, that show how many samples each gene are expressed. The resulting table refers to 129 samples with 35,135 genes. The data was afterwards normalized and filtered, while the library size was reduced and the dependence of the variance on the mean was removed. Thus, out of the 35,135 genes, 10,796 that were expressed in all 129 samples were kept for further analysis. For the result of 10,796 genes, automatic filtering was performed based on the average of the normalized measurements for each gene. DESeq2 and the Benjamini–Hochberg (BH) false discovery rate (fdr) for multiple hypothesis testing correction were used to calculate the fdr adjusted p value for each gene. The study is limited to 10,796 genes that were expressed in all samples, because the main goal of the study was to find genes that are important and expressed in all types of cancer that we examined.

For the third quality control and evaluation of the whole experiment, the Principal Component Analysis (PCA) plot, as shown in Fig. 1, for the samples’ distance was used. The PCA plot was performed with DESeq2 which offers the variance stabilization transform (VST) for negative binomial data. This means that the differences between the normal samples and the tumor will contribute to the expected mean variation of the experiment. The graph also shows the samples in the 2D plane extending from its first two main components, where the first dimension concerns the separation of cancer types in the samples and the second dimension concerns the separation of samples into tumor data sets from normal data sets.

Fig. 1

Principal component analysis plot. From the above plot we see that the differences between the two conditions (Tumor and Normal) are significant. The samples concerning healthy people have short distances from each other in relation to other types of cancer. Also, some distances are observed between lung cancer samples compared to other types of cancer

DESeq2 was used for differential gene expression in R. The dispersion estimation, Wald statistic was performed, where the negative binomial model for each gene was placed and the nbinomWaldTest was used to control differential expression. 10,796 genes were found with a significant change in gene expression between samples and after filtering the p value with p-value < 0.0001, recovering the normalized measurements and comparing tumor versus normal samples, the 4,559 genes were found to be the most important differentially expressed. Ensembl transcript names were converted into gene symbols using the AnnotationDbi package. To visualize the most important differentially expressed genes, the Volcano Plot was created, which shows the relationship of expression change between the two conditions. The Volcano plot, as shown in Fig. 2, is a type of scatter plot that shows statistical significance (p value) versus fold change. It allows fast visual recognition of genes with changes that are statistically significant.

Fig. 2

The Volcano plot with the 4,559 most differentially expressed genes. The most upregulated genes are to the right and the most downregulated genes are to the left. The highest upregulated genes are at the top. The 10 of the most important differentially expressed genes are KCTD20, ZNF185, VCL, ITGB1, F13A1, TPP1, EIF4G2, PRKAR1A, ABCC3 and CORO1C

Functional enrichment analysis was performed with the gProfileR package and adjusted value p < 0.0001. Significantly enriched GO terms were identified, of which 135 were overrepresented and classified into 94 Biological Process (BP), 11 Molecular Function (MF) and 30 Cellular Components (CC), respectively. The most important term GO in frequency and uniqueness was found in the term GO: 0005515, which belongs to the group MF called protein binding, with a frequency of 76.4%. Among all GO terms, the most important upregulated genes are KCTD20, ZNF185, VCL, ITGB1, F13A1, TPP1, EIF4G2, PRKAR1A, CORO1C and the most important downregulated genes are GPNMB, ZNF835, MARN22. The enrichment analysis of the gene sets with the gage package was then performed. A total of 156 KEGG pathways were identified with an adjusted value of p < 0.0001, of which 7 were upregulated and 149 were downregulated. The differentially expressed genes were gathered in the following pathways, hsa00190-oxidative phosphorylation, hsa04145-phagosome, hsa04810-regulation of actin cytoskeleton, hsa04510-focal adhesion, hsa04670- leukocyte interstitial migration and hsa004144-endocytosis. Among all pathways, the most important upregulated genes are ITGB1, VCL, CORO1C, ABCC3, F2R, ACTN1, CDC42, GRB2, EHD3, NDUFA4 and downregulated are NDUFV3, CORO2A, TLR6, SL LDLR. GO enrichment analysis revealed that the predicted regulatory gene group was enriched with genes involved in differential expression analysis.

Subsequently, two supervised learning algorithms were tested. The choice of these two algorithms was made after analyzing many algorithms. It is common to explore multiple algorithms and techniques during RNA-seq analysis to determine which one(s) provide the best results for our specific research question and data set. Furthermore, feature engineering, data preprocessing, and cross-validation play critical roles in the success of machine learning in RNA-seq analysis. Random Forest and Gradient Boosting algorithms in addition to the results were found to be suitable and fit our research objective because we prioritized robustness, feature importance analysis—providing feature importance scores, and high predictive accuracy.

We used Random Forest because it is a powerful, easy-to-implement model that handles high-dimensional data and provides feature importance scores for gene selection. It is robust to outliers and noisy data. This is beneficial in RNA-seq analysis, where gene expression data can have variations and technical noise. It can be used for binary or multiclass classification of genes based on their expression patterns in different cancer types.

We used the Gradient Boosting algorithm because it offers high prediction accuracy which is critical in RNA-seq analysis and can be used for accurate gene classification and feature selection, helping us identify differentially expressed genes. It can handle noisy data by iteratively improving predictions and reducing the impact of outliers. And if the RNA-seq dataset has a limited number of samples, Gradient Boosting can perform well due to its iterative nature and focus on correcting misclassifications.

Before starting training, exploratory data analysis was done to see how the variables and samples were related to each other. The first thing we did is data normalization and transformation. We took care of data scaling issues that might come from how the experiments were run and potential problems that might arise during data collection. The next step was to transfer our data. We then filtered the predictor variables and selected arbitrary cutoffs for variability. The expression values of the initial 35,135 genes were used. To execute the code, the caret::preProcess function was used to filter the predictor variables, the 1,000 best predictors were selected, that is the gene expression values, and then we filtered the highly related prediction variables, creating a filter for the subtraction of related variables. If two variables are sufficiently correlated, only one of them is removed. The classification problem used was binary and involved tumor samples versus normal samples. Of the 129 samples, 108 correspond to cancer samples (30 breast, 10 liver, 19 colorectal, 21 lung, 18 pancreatic, 10 glioblastoma) and 21 are normal samples. The training and testing of the data was done with the method caret::createDataPartition, where the parameter p = 0.7 was set, meaning that training to test ratio is 70:30. This corresponds to 91 samples for the training set and 38 samples for the test set.

To configure the Random Forest and Gradient Boosting algorithms we included the setting of various hyperparameters such as:

RandomForest 1.

n_estimators: This parameter defines the number of decision trees in the forest. We used “ranger” method and the argument tuneGrid in the “train” function, which specifies a grid of parameters.

2.

mtry was set to 100, which is the number of variables randomly selected at each split in each tree. This value is part of the hyperparameter tuning process.

3.

Criterion: Random Forest can use different criteria for splitting nodes in the decision trees. Splitrlule was set to “gini”, indicating that the Gini impurity criterion was used for splitting nodes in the decision trees.

4.

min_samples_leaf: Sets the minimum number of samples required to be at a leaf node. In our code min.node.size was set to 1.

5.

bootstrap: A Boolean parameter indicating whether to use bootstrapping when building trees. Bootstrapping introduces randomness into the model, which can reduce overfitting. We did not explicitly set bootstrap, the default value < TRUE > was used.

6.

random_state: Controls the random seed for reproducibility.

We set the random seed to 17 using set.seed (17).

Gradient boosting 1.

n_estimators: This parameter defines the number of boosting stages (trees) to use. In our code, we set nrounds = 200 in the tuneGrid, which corresponds to the number of trees in the ensemble.

2.

Learning_rate: Determines the step size at each iteration while moving toward a minimum of a loss function. The code specifies a range of values for the learning rate (0.05, 0.1, 0.3), indicating that it is likely being tuned during the cross-validation process.

3.

max_depth: The maximum depth of individual trees. The code specifies a range of values for the maximum depth of trees (max_depth = 4).

4.

We set gamma = 0, which is a regularization parameter that controls the complexity of individual trees.

5.

subsample: The code specifies a value of 0.5 for subsampling, indicating that a fraction of samples is used for fitting the trees.

6.

We set the minimum sum of instance weigh (hessian) needed in a child. min_child_weight = 1. It’s a regularization parameter.

7.

random_state: Controls the random seed for reproducibility, just like in Random Forest. We set the random seed to 17 using set.seed (17).

Based on our data, we modeled these hyperparameters, using techniques such as grid search or random search, and tested various combinations of hyperparameters to find the best set for our data. Also, variable importance was calculated and plotted.

More specifically, we tested the Random Forest algorithm with 100% success rates for the training set and 84.21% for the test set and the Gradient Boosting algorithm with 98.9% success rates for the training set and 86.8% for the test set.

Experimental results indicate that both classifiers had good results. But most importantly, the variables were checked with the Random Forest algorithm (Fig. 3) and Gradient Boosting algorithm (Fig. 4) and the result was that there is an overlap with the most important genes from the results of differential expression and functional enrichment of the genes (GO terms and KEGG pathways). The genes commonly found were VCL, F13A1 and ACTN1.

Fig. 3

Random forest algorithm plot. The above plot indicates the overlap of the Random Forest algorithm with the most important genes (identifiers) from the results of differential expression

Fig. 4

Extreme Gradient Boosting algorithm plot. The above plot indicates the overlap of the Gradient Boosting algorithm with the most important genes (identifiers) from the results of differential expression

View original article

HEALTH INFORMATION SCIENCE AND SYSTEMS

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Investigating the overlap of machine learning algorithms in the final results of RNA-seq analysis on gene expression estimation

Comments (0)