Gene co-expression network construction and analysis for identification of genetic biomarkers associated with glioblastoma multiforme using topological findings

Data collection and preprocessing

GBM datasets were downloaded from TCGA (https://portal.gdc.cancer.gov/). TCGA repository is a rich source of multiple omics data represented by varied genomic profiles. The DNA methylation profile from human glioblastoma samples was obtained from the GDC portal. TCGA DNA methylation data of GBM patients comes with epigenetic markers that are helpful to understand suspected regulatory roles in disease progression [16]. The GBM datasets obtained from TCGA required filtering out lower quality data points and outliers [7]. It is also necessary to perform pre-processing steps like cleaning to remove data inconsistency, data transformations, and data reduction as per the requirement of the study to conduct survival analysis [17].

Methodology

In the first step of implementation, Preprocessing is performed on the GBM dataset which consists of a gene identifier, expression value for each gene for 76 GBM patients along with their survival information. Genomic data get produced at the rate of 10 terabytes a day and require complicated processing to transform massive amounts of noisy raw data into biological information [18]. It is very essential to perform end-to-end processing of genomic data, which includes data aligning, variation discovery, and deep analysis. In this study also, filtering is integrated into the preprocessing phase in order to prepare data for applying appropriate techniques for identifying survival outcome associated genes [19].

A method known as survival analysis is implemented to identify genes associated with the cancer patient’s overall survival. Various methods of non-parametric, semi-parametric, and parametric methods of survival analysis are studied and analyzed to find one which is to be applied finally. The Cox proportional hazards regression model is used to identify possible factors associated with patients’ overall survival [20]. However, overall survival (OS) is defined as time between dates of diagnosis till date of death or last follow up. In entire study we assume that progression of the disease is represented by earlier death, so patient died earlier just because rapid progression of the disease corresponding to the penetrating genes (genetic biomarkers) associated for it. So, we explored the penetrating genes for death and defined as disease progressive genes. This model is used to evaluate the effect of those factors and subsequently examine how a genetic marker controls the rate of a particular event (e.g., death) at a specific point in time. This is termed as hazard rate. Influencing factors are covariates in the survival-analysis literature. Cox proportional hazards regression model is applied to the DNA methylation data. Out of 24,925 genes, 156 are identified as significant genes (p value ≤ 0.01) associated with the patient’s overall survival from the genomic dataset as an outcome of this step as shown in Fig. 1. Gene co-expression networks are constructed from these 156 genes extracted from the DNA Methylation dataset [21].

Fig. 1figure 1

Workflow of the methodology used for identification of genetic biomarker. a Data pre-processing steps. b Survival analysis. c Gene co-expression network construction steps. d Gene co-expression network analysis

In the third step, the gene co-expression network is constructed by using three methods of correlation measure, i.e., Pearson, Kendall, and Spearman. Correlation measures are selected in order to establish a link between two significant genes while constructing a network [22]. A threshold value is selected arbitrarily to focus on moderate correlation to represent in the network. Finally Pearson correlation is used for network construction. In the fourth step, the constructed gene co-expression network is analyzed using structural and topological properties of the network such as degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality. A set theory is applied in which the operations such as intersection, union, and difference are performed for the identification of the most influential genes associated with GBM progression. The detail method is shown in Fig. 1.

Construction of the gene co-expression network

A set of co-expressed genes produces proteins. The correlation among the genes expressed in different biological conditions is captured by gene co-expression networks [23]. It is represented as an undirected graph G = (V, E) where the V represents genes, E represents the edge connecting two genes that are significantly co-expressed. An edge connecting a pair of nodes indicates that the corresponding genes have significantly similar expression patterns, which in turn indicate that genes are active under the same biological condition. The gene co-expression network is shown in Fig. 2 which consists of 6 genes (G1, G2, …, G6). The connection between genes shows that they have similar expression patterns which mean that they have a high correlation (above the threshold). Co-expressed genes are very important from the biological point of view as they are controlled by the same transcriptional regulatory program, they show functional relations or they are members of the same pathway or protein complex.

Fig. 2figure 2

Gene co-expression network comprising 6 genes and connection between genes indicates that they have significant correlation

Many methods are developed for constructing gene co-expression networks [24, 25], which basically follow a two-step approach. In the first step, for every pair of genes, a similarity score is calculated using an appropriate co-expression measure. Then, a pair of genes is linked by an edge in the network having correlation scores more than the selected threshold which shows that the gene pair has a significant co-expression relationship. For developing gene co-expression network, matrix form is used to provide input. The m × n matrix represents genes and n samples. A stepwise process of gene co-expression network construction is shown in Table 1. We have shown this process on top five genes out of 156 genes ( having p value < 0.01) obtained through survival analysis in order to simplify the network construction process. The same set of steps is followed for gene co-expression network construction with 156 genes. Step 1(a) shows survival outcome associated genes identified through cox regression analysis with their expression values. Correlation value (up to two decimal points) between every gene is presented in step 1(b) which is calculated using Pearson correlation measure. Network adjacency matrix is obtained as shown in step 1(c), based on arbitrary threshold value as 0.5. If the correlation value is above 0.5, it is indicated by ‘1’ otherwise it is ‘0’. Figure 3 shows constructed gene co-expression network based on adjacency matrix of step 1(c). Value “1” in the adjacency matrix represents link (correlation) between two nodes (genes) and “0” represents absence of link which indicates that there is no significant correlation exist between pair of genes. For better visualization, subset of dataset is selected (i.e., 25 genes out of 156 survival outcome associated genes) for representation gene co-expression network [26, 27].

Table 1 Gene co-expression network construction steps (a) five genes with their expression values (b) co-relation matrix showing correlation among five genes (c) network adjacency matrix where correlation above threshold (> 0.5) is presented as ‘1’, otherwise it is ‘0’Fig. 3figure 3

Gene co-expression network based on adjacency matrix shown in Table 1(c)

Overall analysis

There are different network-based measures on the basis of which gene co-expression network can be analyzed [28]. Centrality measures are an important tool in social and complex network analysis to quantify the eminence of nodes. A centrality measure is an estimation of the structural importance of a node based on its location, connectivity, or any other structural property. Several measures are coined in literature. Among all, centrality measure is found to be important to identifying most influential nodes depicting biomarker genes associated with progression of the disease from disease network.

Gene co-expression network constructed in the earlier step is analyzed using different centrality measures. For the simplicity we have shown the network by considering top 25 genes (P_value), out of 156 genes identified in the previous steps. Figure 4a shows the initial network of genes and their correlation computed through Pearson measure. Similarly the network is constructed using other two correlation measures which are Spearman and Kendall [29, 30]. There are various approaches used for analysis of gene co-expression network. Topological properties are found useful in network analysis. Among all other measures centrality measures are important to apply on the network to decide importance of the node within the network. In this study, three centrality measures namely degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality are applied on the gene co-expression network to determine each node’s (gene’s) significance in the disease network [31]. The graph visualization method is used to show the nodes with higher degree centrality value in red color and in varied size according to their respective degree centrality value in Fig. 4b. Similarly other centrality measures are applied independently on the each gene co-expression network constructed using three different correlation criteria.

Fig. 4figure 4

Gene co-expression network analysis using degree centrality. a Gene co-expression network. b Top 10 nodes with high degree centrality are shown with red color and size of those nodes varies as per the degree centrality value

In a similar way, the overall analysis is performed on the network constructed using three measures: Pearson, Spearman, and Kendall. In all the networks, networks constructed using the Pearson correlation measures are found to be more appropriate to analyze further. All four centrality measures are applied on this network and the set of nodes that are satisfying all the four centrality measures are considered to be the most significant which are shown in a red color node in Fig. 5a.

Fig. 5figure 5

Gene co-expression network analysis using centrality measures, nodes shown in red color satisfies all four centrality criteria’s which signifies most influential genes. a Overall analysis. b Weighted analysis

Weighted analysis

Weighting is a statistical technique in which datasets are manipulated through calculations in order to bring them more in line with the population being studied [32]. It allows researchers to correct issues that occurred during data collection. For this reason, weighting is also known as post-stratification, as it takes place after the sample has been selected. It is referred to as statistical adjustments that are made in order to improve the accuracy of the survey estimates [33]. In this study, we have performed a weighted analysis so as to verify the accuracy of the overall analysis performed in the earlier step.

To decide the importance of the node in the gene co-expression network, we have applied four centrality measures which are degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality [34]. Here weights are assigned to these centrality measures as per their significance in biological networks for most influential node identification. Degree centrality is assigned a weight of 0.6, Betweenness centrality is assigned with a weight value of 0.3 and closeness centrality is allocated a weight of 0.2, and eigenvector centrality is assigned 0.1 weights. These weights are assigned arbitrarily so as to make a total weight of node is 1. This weighted network is shown in Fig. 5b.

Overall analysis and weighted analysis performed on the network constructed considering Pearson correlation criteria shown in Fig. 5. In the network, nodes highlighted in red color indicate nodes satisfying all the four centrality measures so they are inferred as the most influential nodes in the disease network.

We have used well known statistical software ‘R’ for survival analysis, gene co-expression network construction, and analysis from https://cran.r-project.org/web/package to identify genetic biomarkers associated with GBM.

留言 (0)

沒有登入
gif