Var∣Decrypt: a novel and user-friendly tool to explore and prioritize variants in whole-exome sequencing data

Exome-seq analysis pipeline

To provide an all-in-one solution, we first implemented an Exome-seq variant analysis pipeline (available as supplementary material, see Additional file 1 and Additional file 8: Fig. S2 for details). This pipeline can be used with raw sequencing data (e.g., FASTQ files) to generate variant calling files (vcf) and input files for downstream processing by Var∣Decrypt. For users wishing to use their own vcf files, or vcf files from publicly available repositories as input, we built-in a pre-processing tool called Pre-Var∣Decrypt (see https://gitlab.com/mohammadsalma/vardecrypt) allowing to process vcf files to generate Var∣Decrypt input files (see below). This step only needs to be performed once for each batch of samples (i.e., patient cohorts), the resulting files can then be stored or directly used in Var∣Decrypt for downstream analyses.

WES data processing using Var∣Decrypt

In order to facilitate WES data analysis and functional interpretation, we developed Var∣Decrypt, an easy-to-use and user-friendly RShiny tool, which can be deployed via Docker on several operating systems (Linux, macOS), downloaded and installed from open-source to run via Rstudio. In addition, we provide a link to an online trial version of Var∣Decrypt (https://vardecrypt.com/app/vardecrypt) with access to a test dataset, allowing users to quickly be able to assess the tool and evaluate its capacities. Var∣Decrypt includes several R packages to perform different post-VCF downstream analyses, which usually require users with scripting skills to perform tasks such as installing packages, preparing the input data and calling the appropriate function. A detailed tutorial on how to use Var∣Decrypt in a simple and intuitive way is available at (https://gitlab.com/mohammadsalma/vardecrypt/-/wikis/Var%7CDecrypt) and as a video tutorial on the front page of Var∣Decrypt online version. Var∣Decrypt imports the output results from the Exome-seq pipeline or vcf files processed through Pre-Var∣Decrypt (Additional file 1) and provides many built-in enrichment analyses options, helping researchers to develop or confirm hypotheses, to easily explore the differences between normal and tumor samples, and to prioritize variants, genes and pathways for functional analyses. Var∣Decrypt is a fast-operating tool which provides multiple outputs within short time frames (i.e., seconds to minutes for loading and processing a full dataset, Additional file 6: Table S5). The output results and variables are saved in an Rdata file which lets users to explore Var∣Decrypt results subsequently, instead of re-running the analysis. Var∣Decrypt allows to explore, filter, sort genes containing variants, or to search for a specific gene through dynamic interfaces (see below).

Overall presentation of Var∣Decrypt

The Var∣Decrypt interface is composed of several tabs allowing users to get a general overview of the Exome-seq data and to browse the mutated gene lists or focus on single genes, single variant types (e.g., stop gain and frameshift deletions).

The ‘Somatic variants explorer’ tab provides a gene list and summary table containing all detected mutated genes (we define the somatic variants as being the ones specifically acquired in the tumor sample as compared to the control cells; variants or mutations present in the control cells are considered as germline variants as they are not somatically acquired) (Additional file 8: Fig. S2). For each gene in the table, the total number of variants detected is indicated, together with the different types of variants identified, and the percentage of patients bearing a mutation in a particular gene. The right part of the table shows for each gene which patient sample contains the indicated variants (Additional file 2: Table S1). Instead of focusing on the variants themselves, this dynamic table is gene-centered, and it also provides information on the number of variants detected in the cohort for each gene, the types of variants (e.g., stop gain, frameshift deletions, etc.) and the percentage of patients bearing mutations on a particular gene. When using the ‘mutation rate’ column, users can sort the entire mutated gene list by mutation frequency (i.e., number of patients showing a mutation or variant within a given gene), which provides an overview of the top mutated genes. All types of variants are shown by default, but users may choose to highlight only a subcategory of variants such as stop gain, frameshift variants (deletions, insertions), etc. Whereas the germline variants from a patient are usually used to filter-out nonspecific variants in cancer samples, Var∣Decrypt also allows working on the germline variants (‘Germline variant explorer’) which is useful for the study of Mendelian genetic disorders or family case studies (not shown here).

The ‘General statistics’ tab provides information on the frequency of variant types within the cohort using a color code for the different types (e.g., frameshifts, non-sense, missense, etc.), the class of SNV (e.g., C > T, T > G, etc.) which may be useful to check if a particular bias is present in the samples or in the disease under study (Fig. 2). This tab also provides information on the total number of variants per sample, a feature that helps to quickly spot any outlier within the datasets. As exemplified in Fig. 2, sample m_13_D from our cohort contains ~ 30-fold more variants than the average of the other samples, likely arising from technical issues during the sequencing or sample handling procedure. Such problematic samples can, therefore, be quickly spotted and excluded from further analyses. Finally, the top 20 mutated genes are shown with the same color code as for the variant types, to get an overview of the recurrently mutated genes (Fig. 2).

Fig. 2figure 2

General overview of the WES datasets. The general features of ten erythroleukemic samples are displayed, showing the variant classification (color-coded as a function of the type of mutation, top left), variant type (single nucleotide polymorphisms (SNP), insertions (INS) and deletions (DEL), top middle), and single nucleotide variant (SNV) class (top right). The bottom panel displays the number of variants per sample (each column represents a unique patient), using the same color code as in the variant classification panel. Note that patient m_13_D is spotted as being an outlier with ~ 30-fold more variants than in the other patients. The dashed red line represents the median number of variants in the cohort. The middle panel shows the variant classification summary in the cohort, using the same mutation-specific color code. Finally, the bottom right panel shows the top 20 mutated genes in the patient cohort (the number of variants/mutations is shown on the horizontal axis), with the percentage of patients bearing a mutation in a given gene indicated

Identifying the recurrently mutated gene fraction within a patient cohort

Var∣Decrypt offers the opportunity to quickly and easily browse WES data in order to identify recurrently mutated genes. By navigating in the somatic menus, users can in one click access the gene mutations frequencies (i.e., gene mutation percentage within the cohort), an important feature allowing to point at key genes likely involved in the disease phenotype. One key step in the discovery of cancer drivers is to be able to pinpoint the recurrently mutated genes within patient cohorts, as recurrently mutated genes likely represent true oncogenic drivers or genes important to sustain the cells’ transformed state. However, despite all the filtering steps applied in various Exome-Seq analysis pipelines, a very large number of variants usually remains, especially in cancer samples. This represents one of the main challenges to prioritize gene mutations when dealing with Exome-seq datasets.

Filtering of putative false-positive gene mutations

A common issue of Exome-Seq data from short reads-associated sequencing platforms (such as Illumina sequencing) is the large fraction of variants called at genes harboring repetitive sequences, such as variable number of tandem repeats (VNTRs). The MUC gene family [27] is a good example of such problematic alignment and variant calling situation, as they contain long polymorphic stretches of ~ 60 bp repeats VNTRs, which is problematic with the current aligners and variant callers. Although some true causative variants may indeed be present within the VNTRs of the MUC gene family [28], we created an empirical filtering option allowing users to define a threshold for the maximum number of variant allowed per gene, in order to ‘clean-up’ the mutated gene list and get rid of the error-prone VNTR-containing gene sequences in the patient cohort. As a result, by setting a threshold of a maximum of 4 variants per gene in a maximum of 20% of the patients, we could get rid of the apparently highly variable and likely false-positive mutated genes in the final list (Fig. 3).

Fig. 3figure 3

Custom filtering of putative false-positive mutations. The top 20 mutated genes are show before (A) and after (B) applying the custom filters. This shows that without this filtering step, a number of genes score positive in 100% of the patients, including gene families containing variable number of tandem repeats (e.g., the MUC gene family). After applying a threshold (maximum of 4 variants per gene in a single patient, in a maximum of 20% of the patients) and selecting the option to retain genes present in the COSMIC database, the resulting mutated gene list is highly enriched in known oncogenic drivers and previously reported AEL-mutated genes

Another commonly used strategy to enrich for putative causative variants is to filter the mutated gene lists against cancer gene databases such as COSMIC, OncoKB or NCG [29,30,31]. We also implemented a filtering option allowing to focus on the mutated genes that are tagged as cancer-associated from such databases. The resulting outputs, therefore, are highly enriched in putative oncogenic drivers, allowing to explore the mutational landscape of human cancers. As confirmation, applying such filtering strategy on our AEL WES data produced a mutated gene list enriched for previously reported AEL-associated gene mutations [30, 31] such as the epigenetic modifiers TET2, NCOR1, NCOR2, BCOR, BCORL1, the CBP(CREBBP)/p300(EP300) co-activators, the polycomb repressive complex proteins EZH2, ASXL1, ASXL2, and the cohesin complex component RAD21 (Additional file 2: Table S1).

Integration of enrichment tools

An important aspect of Var∣Decrypt is the access to various types of enrichment analyses thanks to the implementation of dynamic customizable graphical outputs. Var∣Decrypt contains different disease ontology, gene ontology (e.g., biological process, molecular function, and cellular component), and Reactome/Kegg pathway enrichment tab offering the opportunity to identify particular pathway of functional alterations in the samples. The ‘enrichment’ tab offers users to quickly identify enrichments of disease ontology terms, biological pathways (Reactome, KEGG and WIKIpathways), or Gene-Ontology (GO)-terms such as ‘Biological Process’, Molecular Function’, or ‘Cellular Component’ linked to the mutated gene lists. In addition, searches for gene–disease associations or gene–cancer associations are also available to highlight many known associations with established human disorders and cancers. For each category, users can choose between three different graphical outputs including bar graphs, association or enrichment factor along with color-coded p-value representations (Fig. 4AC). These outputs are dynamic and customizable as users can switch from one representation to another or increase/decrease the number of categories to display in one click. Var∣Decrypt also provides a somatic interaction view in order to identify which gene mutations tend to co-occur or are mutually exclusive. In the example shown in Fig. 4D, BCOR and XPC mutations seem to be mutually exclusive, suggesting that inhibiting BCOR activity in XPC-mutated AEL cells (and vice versa) may be therapeutically beneficial.

Fig. 4figure 4

Disease and pathway enrichment features. Var∣Decrypt allows to depict various enrichment plots using enrichment factor (A), qValue bar plot (B) or cluster tree (C) visualization for disease ontology, biological pathways (Reactome, Wiki, KEGG), various gene ontology (GO) categories (biological process, molecular function, cellular component), gene–disease and gene–cancer associations. D Matrix showing the mutually exclusive (brown) or cooperating mutations (green) in the AEL patient cohort. Dashed red lines highlight the mutually exclusive BCOR and XPC mutations

Another useful built-in feature is the enrichment of mutations in genes belonging to known oncogenic signaling pathways. This feature provides a graph representation of the enriched mutated pathways together with the number of patients bearing mutations in the related pathways (Fig. 5A). By simply clicking on a given pathway (right part), users can display a detailed list of the genes contained in the chosen pathway and check which gene and which patient sample harbor the mutation(s). This representation is useful to identify the recurrently mutated genes within a single oncogenic pathway and to check which signaling component is frequently altered in the disease. The example depicted in Fig. 5B shows that the Receptor tyrosine kinase/RAS pathway, and the Notch and TGF-β pathways are frequently altered in AEL patients, and that the ABL Proto-Oncogene 1 (ABL1), NOTCH2, and TGF-β receptor 2 (TGFBR2) receptor genes are among the top mutated genes.

Fig. 5figure 5

Overview of oncogenic pathway alterations in the patient cohort. The oncogenic pathways affected in the AEL patients are shown (A) with the number of affected genes in relation to the total number of genes linked to each pathway. The second plot displays the fraction of patients bearing mutations in a given pathway. B For each pathway, Var∣Decrypt allows visualizing the mutated genes for each patient to easily spot the recurrently mutated genes. Oncogenes and tumor suppressor genes are indicated in blue and red, respectively

Visualization of mutational hotspots and amino acid changes

Finally, Var∣Decrypt provides a visualization tool depicting the localization of mutations on a given gene product (protein). The known protein domains are displayed along with the position of the various mutations or variants detected, with a color code indicating the variant types (STOP gain, frameshifts, non-synonymous SNPs). This feature allows to detect mutational hotspots and preferential localization of mutations in functional protein domains, as shown in Fig. 6A in the succinate dehydrogenase complex flavoprotein subunit A (SDHA) gene. In addition, a table provides the identity of amino acid changes along with several variant metrics (Additional file 3: Table S2).

Fig. 6figure 6

Visualizing mutational hotspots. A The ‘amino-acid changes’ page displays the protein domains with the localization and type of mutations in the entire cohort. The example of the SDHA gene is shown. B Structure (accession #6VAX) of the SDHA active site from [44]. Only amino acids 446 to 472 are shown. The key active site residue R451 is indicated in black, the positions of the mutated residues in AEL are shown in red (left). Right, similar representation using the ‘schematic view, with amino acid side chains shown as sticks and balls using MMDB viewer [55]

Discovery of putative novel oncogenic mutations in AEL

We applied Var∣Decrypt to decipher the mutational landscape of AEL. Besides the known and recently described mutations in TET2 (40% of patients in our cohort), TP53 (30% patients), EZH2 (10%), NCOR1/2 (50%/20%) or GATA1 (in 10% of the patients), our tool highlighted mutational hotspots in several additional genes, likely representing important components of the AEL mutational landscape. We identified several mutations within the SDHA gene (Fig. 6A), a critical member of the succinate dehydrogenase complex, which were not primarily identified in the previous AEL studies [30,31,32,33,34]. The succinate dehydrogenase (SDH) complex is a mitochondria-localized multiprotein complex involved in cellular respiration through the electron transfer chain (complex II) [35, 36]. The SDH is nuclearly encoded and composed of 4 subunits (SDHA-D). Loss of function of any of SDH subunits may associate with neuroendocrine tumors or neurodegenerative disorders such as Leigh’s disease [36]. We identified a mutational hotspot within the SDHA gene in 70% of patients (Fig. 6A). Interestingly, all detected missense mutations (V446A, A449V, A454T, S456L, R465Q, A466T, C467S) cluster around the key SDHA active site residue SDHAR451 [35] (Fig. 6B). Although the precise functional impact of such mutations is currently unknown, some or all may significantly alter SDHA active site spatial conformation and lead to (partial) insufficiency. Measuring complex II activity in AEL cells and its requirement for leukemia development is out of the scope of this study but represents an interesting lead to follow, as complex II alterations may be of importance for the development or maintenance of AEL.

Finally, on a more global scale, Var∣Decrypt allowed us to identify enrichment of mutations in oncogenic signaling pathways, in particular, the receptor tyrosine kinase (RTK)/RAS pathway, with a high prevalence of IRS1 mutations (40% of patients, including in-frame deletions, non-synonymous SNVs and a STOP gain), and mutations in FGFR1, RET, JAK2 (30, 20, 10% of patients, respectively), or BRAF (10%). Importantly, our tool highlighted the Notch pathway as frequently mutated in AEL with Notch1 and Notch 2 receptor variants found in 40% and 60% of the patients, respectively.

Taken altogether these data highlight putative novel oncogenic processes and pathways in AEL, and underscore the usefulness of Var∣Decrypt to provide leads for functional explorations.

Validation of Var∣Decrypt using an independent dataset of 90 multiple myeloma samples

We sought to validate Var∣Decrypt using an independent dataset. To this aim, we analyzed published WES data from 30 human multiple myeloma cell lines (HMCLs) and primary multiple myeloma (MM) from 59 patients [29]. Previous analysis of these data revealed a prevalent TP53 mutational landscape and altered MAPK pathways. Reanalysing this dataset with Var∣Decrypt after filtering out the putative false-positive hits (i.e., highly mutated gene families such as MUC genes, see above) using the filtering options (frequency less than 4 mutations by gene within 20% of the cohort), and after crossing the mutated gene list with cancer gene databases (COSMIC, as in [29]) led to very similar identification of MM mutated hits, with frequent TP53 (47%), KRAS (40%), NRAS (30%), ATM (33%) alterations, and many epigenetic modifiers (BRD3, BRD4, SETD1B) and DNA repair proteins (FANCD2, RECQL4) (Additional file 4: Table S3). In particular, we identify the MAPK/RAS pathway as recurrently altered (Additional file 9: Fig. S3 and Additional file 10: Fig. S4) [37, 38], validating the functionality of Var∣Decrypt.

Performance

Var∣Decrypt inputs are generated by one of the two provided pipelines (Additional file 7: Fig. S1). These pipelines can be used locally or on a cluster. As trimming and alignment steps are resource consuming, the pipeline handling data from fastq format are highly recommended to be used on a cluster. To test the compatibly of our annotation pipeline with large range of aligners and variant calling tool, we have tested it using publicly available VCF files with different file sizes from different variant calling tools [2] using two different methods of deployment (Additional file 5: Table S4). As Var∣Decrypt is an RShiny application, it should run without issues on the majority of web browsers. We have measured the performance of Var∣Decrypt using public data of primary multiple myeloma (MM) from 59 patients [29]. The time of the processing and the memory resource usage were evaluated during: (1) the processing of new data; (2) the reload of already processed data (Additional file 6: Table S5, Additional file 11: Fig. S5). These measurements indicate that Var∣Decrypt is a fast-operating tool, with loading and processing times ranging from seconds to minutes (< 3 min 5 s for the larger datasets on a regular laptop with 8 GB of RAM, < 1 min and 30 s with 16 GB of RAM).

Comparison with other available tools

Other available tools provide some of Var∣Decrypt functionalities, but either are (i) only available online, (ii) require bioinformatics expertise to prepare data and export the results in a human readable format, (iii) handle only one type of variants (i.e., somatic/germline), or (iv) do not support variant and enrichment results visualization (Table 1).

Table 1 Comparison of VarlDecrypt with other available tools

留言 (0)

沒有登入
gif