Machine learning-based analysis of cancer cell-derived vesicular proteins revealed significant tumor-specificity and predictive potential of extracellular vesicles for cell invasion and proliferation – A meta-analysis

Machine learning methods revealed tumor-specific protein patterns of EV proteomeShared proteins of EVs are related to EV biogenesis processes

The proteomic data set of the 60 EV samples contained 6,071 proteins. Intensity was measured for 5,908 proteins, referred to as the entire proteome in this study.

According to Gene Ontology Enrichment Analysis, the entire proteome is significantly associated with biological processes, molecular functions and cellular compartments such as neutrophil-mediated immunity, cell adhesion to the extracellular matrix, secretory vesicles and granules (Additional file 1). The fold enrichment values—which indicates how drastically genes of a certain pathway are overrepresented—ranged between 1.68 and 3.01. This means that we identified at least 1.68 times more proteins from the listed signal pathways as it would have been expected by chance.

Of the 5,908 proteins, 213 were present in all EV samples, referred to as the core proteome. The enrichment analysis of the core proteome showed that the shared proteins are involved in intracellular and EV biogenesis pathways, such as cotranslational protein targeting to membrane, RNA binding and cytosolic ribosomes (Additional file 2). Association of the core proteome with each biological pathway showed higher significance than the entire proteome, which was reflected in the fold enrichment values ranging from 3.78 to 33.12.

Entire proteome of EVs resulted higher classification accuracy of tumor cell lines than core proteome

We first inspected the core proteome for tumor-specific patterns using the logistic regression classification model.

Remarkably, even this small subset of the entire proteome affecting a few biological processes carried enough specific information to distinguish certain tumor types from the others to some extent, such as kidney, lung, leukemia and melanoma (Fig. 1a, c). The classification accuracy of 49.14% significantly outperformed the 11.1% that would have been obtained with random classification.

Fig. 1figure 1

Classification efficiency based on the core and entire proteome. a t-SNE plot of the core proteome. b t-SNE plot of the entire proteome. The dots with different colors represent the 60 individual EV samples belonging to the nine tumor types. The color gradient in the plot indicates the dot density. c Confusion matrix of the classification results using the core proteome. d Confusion matrix of the classification results using the entire proteome. Each row of the matrices represents the instances in an actual class while each column represents the instances in a predicted class. Diagonally, the percentage of the correct classification is shown in blue. The percentage of errors is indicated in red

As expected, a one-way ANOVA analysis revealed that the average intensity of the core proteome depends on tumor type (p < 0.0001). However, Pearson’s correlation analyses confirmed that this difference could not be caused by differences in EV secretion, EV mean and mode size, or cell size. No significant correlation was identified between any parameter and the average intensity of the core proteome. This suggests that the unique core proteome pattern is not caused by the difference in EV production rate and type of EVs between the nine tumor types, but the different tissue origin.

Using the entire proteome, the distinction between tumor types had become even more defined (Fig. 1b, d). Classification accuracy significantly increased for CNS, colon, leukemia, lung, melanoma, and ovary. The average classification accuracy increased to 69.10% which is 57.99% higher than chance.

The EV proteome could be used to form a discriminative protein panel

In exploring the discriminatory protein panel, we have taken care to ensure that the method does not become overestimated or overfitted. To achieve this, the 60 cell lines were split 50–50%. On one half of the cell lines, the Train set, we applied the LASSO algorithm.

Using the LASSO method, we were able to assign importance scores to each protein of the entire proteome based on their ability to differentiate the nine tumor types in the Train set. The selection algorithm (with parameter C = 1) resulted in 172 proteins, which were further investigated for hierarchical clustering, classification purposes and Reactome pathway analysis (Additional file 3).

In the hierarchical clustering, the Train and Test sets were analyzed together on the basis of 172 proteins.

Hierarchical clustering using a heatmap revealed that the 172 proteins form a well-defined pattern, enabling the 60 EV samples to form nearly perfectly homogenous clusters, while the Train and Test sets elements are clustered together (Fig. 2a).

Fig. 2figure 2

Classification efficiency for the selected proteins. a Heatmap with hierarchical clustering. In the heatmap, the columns and rows represent the 60 EV samples belonging to the nine tumor types marked with different colors and the 172 proteins, respectively. Both the columns and rows are clustered. Dendrogram branches ending in a square indicate the elements to be included in the Train set. b t-SNE plot of the selected 172 proteins. The dots with different colors represent the 60 individual EV samples belonging to the nine tumor types. In the plot, the color gradient indicates the dot density. c Confusion matrix of the classification results using the selected proteins on the Test set. Each row of the matrices represents the instances in an actual class while each column represents the instances in a predicted class. Diagonally, the percentage of the correct classification is shown in blue. The percentage of errors is indicated in red

This separation is also evident in the t-SNE plots, which depict the various tumor types as distinct groups (Fig. 2b). Again, the elements of the Train and Test sets populated the same areas.

When the samples of the Test set were classified based on the 172 proteins, an average classification efficiency of 91.67% was achieved (Fig. 2c).

For the whole data set (Train + Test), the average efficiency was 96.60%.

Discriminative proteins might uncover tumor-specific pathways

After selecting the proteins, we hypothesized that – given the proteins' large intergroup differences – the biological signaling pathways they affect would also exhibit distinctive patterns. In order to place the 172 selected proteins in a biological context Reactome enrichment analysis was utilized. Only those pathways with p < 0.05 were considered for hierarchical clustering and heatmap creation (Fig. 3).

Fig. 3figure 3

Biological signaling pathways affected by the 172 selected proteins of the discriminative protein panel. The columns marked with different colors represent the 60 EV samples, while the rows indicate the various signaling pathways. Both the 60 samples and pathways were clustered hierarchically. The heatmap values represent the average intensity of the proteins that are part of a given signal pathway. The gray barplots next to the names of the pathways indicate the -log10(p value). In all instances, p < 0.05. (agg.: aggregation; biosynth.: biosynthesis; cotrans.: cotransporters; deacet.: deacetylate; form.: formation; mod.: modifying; org.: organization; phosph.: phosphorylation; prots.: proteoglycans; sig.: signaling; trans.: transcription; transl.: translocation)

The selected 172 proteins are associated with extracellular matrix, nuclear processes, and cell division-related signaling pathways.

Although cancers of the breast and prostate lacked characteristic signaling pathways, the majority of the EV samples clustered according to their tumor type revealing a distinctive signaling pathway pattern.

The collagen matrix, TGF-β receptor, and ERB4 enzyme signaling pathways were identified as common characteristics for both kidney and central nervous system tumors, which clustered together.

Compared to other tumors, leukemia samples exhibit a predominance of nuclear processes associated with histone and chromatin modification.

In general, lung tumors were distinguished by platelet-associated biological processes and integrin-signaling pathways.

Extracellular vesicles carry information on invasion and proliferation capacity

The NCI-60 cell line panel contains not only tumors of different tissue origin, but also tumors with different invasion capacities and different division rates.

Noting that tumor cell lines with low invasion capacity such as BT549 and Hs 578 T (breast) were classified into tumors with high invasion capacity (e.g. CNS) during classification and hierarchical clustering the question arose whether further protein panels predicting invasion and proliferation capacity could be defined.

To construct a panel correlated with invasion and proliferation capacity, multiple linear regression with LASSO selection method was utilized.

As in the classification procedure, the data set was split 50–50%. On the Train set, we used LASSO to identify proteins that could be predictive for invasion capacity and proliferation, then validated the findings on the Test set.

The selection resulted in 20 and 15 proteins, which tended to have predictive potential for invasion and proliferation capacity in the Train set, respectively (invasion panel and proliferation panel).

The Test set was then used to validate the predictive value of the panels using multiple linear regression.

Multiple linear regression showing significant results for both the invasion panel and the proliferation panel (p < 0.0001), we also obtained remarkably high coefficients of determination: R2 = 0.68 for the invasion, R2 = 0.62 for the proliferation capacity (Fig. 4). Pooling the Test and Train sets, the R2 values were found to be 0.71 and 0.69, respectively.

Fig. 4figure 4

Results of the multiple linear regression. a Multiple linear regression of invasion capacity. The invasion capacity predicted by the invasion panel for each sample in the Test set is plotted on the x-axis, while the actual invasion capacity is plotted on the y-axis. b Multiple linear regression of proliferation capacity. The doubling time predicted by the invasion panel for each sample in the Test set is plotted on the x-axis, while the actual doubling time is plotted on the y-axis. (R2—coefficient of determination; p—p value.)

After validation on the Test set confirmed the predictive value of the proteins, both of the 20- and 15-member panels (Additional file 4) were then subjected to hierarchical clustering, which resulted in 2–2 clusters (Fig. 5): one cluster that appears to be negatively correlated and another that appears to be positively correlated with invasion or proliferation capacity.

Fig. 5figure 5

Predictive proteins for invasion and proliferation capacity. a Predictive protein panel for invasion capacity (invasion panel). The columns marked with different colors and the gray barplots indicate the 54 EV samples with the invasion capacity measured for the cell line of origin (leukemia not included). The rows indicate the proteins, which were clustered hierarchically. Two defined clusters were separated from each other. b Predictive protein panel for proliferation capacity (proliferation panel). The columns marked with different colors and the gray barplots indicate the 60 EV samples with the doubling time (in hours) measured for the cell line of origin. The rows indicate the proteins, which were clustered hierarchically. Two defined clusters were separated from each other. It should be noted that higher doubling time means lower proliferation capacity as it indicates more time for cell division

Of the 20-member invasion panel, eight proteins (CAV2, DNAJB4, THY1, OXTR, VCAN, COL11A1, EDIL3, CRYAB) positively predicted the invasion capacity of the cell lines. Based on Reactome pathway analysis, these proteins were significantly associated with signaling pathways that upregulate tumor cell maintenance, invasion and binding to the extracellular matrix. Similarly, the enrichment analysis of the remaining twelve proteins that negatively predict invasion capacity was consistent with the regression results: these proteins play a role in pathways that negatively regulate the invasion (Fig. 5a).

The eight proteins that positively influence proliferation capacity were associated with processes linked to cell cycle. While seven proteins negatively associated with proliferation are linked to metabolic pathways (Fig. 5b).

We further attempted to gain more support for our invasion and proliferation capacity prediction panels by examining their impact on patients’ survival time.

The Human Protein Atlas (HPA) was considered an appropriate database for this purpose, as it contains survival times for a large number of cancer patients for all nine cancer types and is easily accessible. However, we had to take into account the limitation that HPA contains tissue RNA expression data and not EV proteomic data.

Accordingly, before utilizing the HPA database, we had to assess the similarity of EV protein and cellular RNA patterns to be permitted to investigate the effect of in vivo RNA tissue expression of panel members on survival time.

First, we examined how the EV protein panels (invasion and proliferation) and the cellular RNAs correlate with each other (Fig. 6). Based on the results, the RNA and protein patterns of the invasion panel showed a moderately strong concordance (RV = 0.51, p = 0.020). While a weaker but still significant relationship was observed when comparing the RNA and protein matrices of the proliferation panel (RV = 0.39, p = 0.048). Notably, we observed stronger pairwise correlations between protein and RNA content for the promoting members of both panels.

Fig. 6figure 6

Correlation of EV protein and cellular RNA content. The heatmaps show the correlation between cellular RNAs and EV proteins of invasion (a) and proliferation (b) panel members. Columns represent the cellular RNA, rows represent the EV proteins

After assessing the relationship between EV protein and cellular RNA pattern, we attempted to use the cellular RNA to estimate the invasion and proliferation capacity of cells using the panel members.

Based on the cellular RNA, invasion capacity could be estimated at R2 = 0.77 (p < 0.0001) and proliferation capacity at R2 = 0.32 (p = 0.037).

The in vitro data suggested that the EV proteomic and cellular RNA patterns are in concordance and that the cellular RNA content is also related to invasion and proliferation capacity in a similar way as the EV proteome. This prompted us to investigate the impact of in vivo RNA tissue expression of panel members on patient survival.

Using the HPA database, we collected clinical data on the tissue expression of our panel members in the nine tumor types from 4,665 patients, then examined the relationship between tissue expression and 5-year survival rate.

In the HPA database, tissue expression was found for 19 of the 20 proteins of the invasion panel (Additional file 5).

According to the HPA, high expression of CAV2, COL11A1, DNAJB4, THY1 and VCAN decreased the 5-year survival for breast, CNS, colon, kidney, lung and ovarian tumors (Fig. 7a). These findings are in line with our results, as these proteins were found to be positively associated with invasion capacity according to multiple linear regression analysis.

Fig. 7figure 7

Survival functions for different expression levels of DNAJB4, CAPN7, DSG2, ECH1. The figure shows 4 exemplary proteins selected from the members of the invasion and proliferation panel and their impact on patients’ survival. a DNAJB4, which we found to be positively associated with invasion and which the Human Protein Atlas (HPA) suggests that its high expression is associated with a worse prognosis in kidney tumors (n = 877). b CAPN7 protein, which in our study is negatively associated with invasion and which the HPA suggests may be associated with a favorable prognosis in kidney tumors. c DSG2 protein which in our study positively predicted the proliferation capacity is a negative prognostic factor in CNS tumors, based on HPA. d Based on our results, ECH1 protein negatively predicted the proliferation capacity, and it is a favorable prognostic marker for CNS tumors

The CRYAB protein was found to be controversial, as our results showed a positive association with invasion, but in HPA, high tissue expression was associated with a better prognosis in CNS tumors. Nevertheless, in colon tumors, high expression was a negative prognostic marker.

The case is similar for EDIL3, which is positively associated with invasion capacity according to multiple linear regression analysis, but based on the HPA, higher tissue expression is associated with better 5-year survival in colon tumors. However, it still was a significantly worse prognostic marker in breast, kidney and melanoma patients.

Overall, the effects on survival found in the HPA database and the effect of the proteins on invasion capacity as determined in our study were consistent in 90% of the cases.

Based on multiple linear regression, twelve proteins in our study were found to be negatively correlated with invasion capacity. Comparing this finding to the HPA database, we found more inconsistencies: according to the HPA, the twelve proteins are favored prognostic markers for 5-year survival in most cases (73.18%) (Fig. 7b), but in 26.82%, the proteins have an adverse effect on survival than the expected. For example, HIST1H3A showed a negative association with invasiveness in our study, but its high expression negatively affected the survival rate of CNS tumor patients according to the HPA database (Additional file 5).

Tissue expression was found for all the 15 proteins of the proliferation panel (Additional file 6). The proliferation panel contains seven proteins which were found to negatively predict the proliferation capacity. According to HPA, high tissue expression of these seven proteins significantly increased the 5-year survival in 64.71% of cases (Fig. 7c). Vice versa, the high expression of the eight proteins which positively predict the proliferation capacity significantly reduces the 5-year survival in 72.41% of cases (Fig. 7d).

Taken as a whole, the EV proteome and in vitro cellular RNA pattern of the panel members showed concordance, and the effect of in vivo tissue RNA expression of the panel members on patient survival is consistent with the results of our linear regression model. The finding potentially suggests the involvement of invasion and proliferation panels in the tumorous processes.

It is noteworthy that the inconsistency with HPA appears for those variables where the in vitro EV proteome and cellular RNA pattern did not show a strong correlation (invasion capacity inhibitory members) (Fig. 4), or cellular RNA did not prove to be a sufficient predictor (overall the proliferation panel).

Comments (0)

No login
gif