Integrative multi-omic sequencing reveals the MMTV-Myc mouse model mimics human breast cancer heterogeneity

Genomic analyses reveal conserved copy number gains in microacinar tumors

Based on the diverse transcriptional and histological subtypes observed in the MMTV-Myc tumors [13], we hypothesized these phenotypes were due to a divergence in genomic changes conserved between each histological subtype. To ascertain putative conserved genomic changes within each histological subtype, short read WGS was performed on randomly selected tumors of microacinar, squamous, and EMT histological subtypes, later integrated with gene expression data obtained from matched samples as well as additional tumors from each subtype (Fig. 1). A great deal of genomic heterogeneity was observed between the histological subtypes as shown by representative Circos plots for the microacinar (Fig. 2A), squamous (Fig. 2B), and EMT (Fig. 2C) tumors. Few inversions and translocations were called across all tumors, consistent with rates of large structural rearrangements in human breast cancer [50]. Most differences in genetic aberrations between subtypes were confined to single nucleotide variants (SNVs) and copy number variants (CNVs).

Fig. 1figure 1

Schematic of workflow. Significant histological and transcriptional heterogeneity was found in the MMTV-Myc mouse model of human breast cancer. To ascertain the genetic origins of transcriptional differences and integrate these omics data between histological subtypes, we performed short read whole genome sequencing at a depth of 40 × for three tumors of each unique and predominant histological subtype: microacinar, squamous, and EMT like tumors. Somatic single nucleotide variants, copy number alterations, inversions, and translocations were profiled after alignment and processing of genomic data. Genomic somatic variants were integrated with and compared to previously obtained Affymetrix microarray gene expression and pathway analysis by single sample gene set enrichment analysis (ssGSEA). Integration of multi-omic data enables the identification of thematic pathways driving tumorigenesis and putative oncogenic drivers for each histological subtype. Importantly, these thematic pathways identified in the MMTV-Myc mouse model share similarities in human breast cancer, impactful analogous genetic events in humans co-occur with Myc amplification, and these pathways affect human breast cancer patient overall survival significantly (Created with BioRender.com)

Fig. 2figure 2

Heterogeneous and conserved somatic features revealed by whole genome sequencing. Representative Circos plots are shown for (A) microacinar, (B) squamous, and (C) EMT histological subtypes. The outermost ring of each Circos plot depicts an ideogram for the mouse chromosomes proportionate with actual chromosome length. The next inner ring shows mutations in genes as stacked blocks at their corresponding genomic locations, color coded to their predicted impact by SnpEff [40]—yellow for low impact, orange for moderate impact, and red for high impact. The next inner ring shows discrete copy number changes as analyzed by CNVKIT; red regions indicate amplification and blue regions indicate deletions. The height of each copy number alteration corresponds to the predicted change in copy number, with the lowest level change being \(\pm \hspace1\) and showing a max copy number change of \(\pm\) 2. The innermost ring reveals inversions and translocations as determined by the consensus of Delly and Lumpy somatic variant callers. Inversions are colored black, while translocations match the color of the ideogram of one of the two chromosomes involved in the translocation event. (D) A CNVKIT heatmap shows the log2 fold change of the estimated normalized copy number segments of each chromosome for each tumor sample relative to the wildtype reference

Strikingly, all microacinar tumors sequenced shared the same whole chromosome amplification events on chromosomes 15 and 11 as revealed by copy number segmentation calls (Fig. 2D). Estimated total ploidy gain for chromosome 15 is 2, for a total of 4 gene copies, and ploidy gain of 1 for chromosome 11, for a total of 3 gene copies. The predicted integer copy number gains were consistent across all microacinar tumors. While many whole chromosome amplifications and deletions were observed in EMT tumors, none were consistent across the histological subtype (Additional file 26: Table S1). Interestingly, there were very few focal and whole chromosome CNVs in individual squamous tumors and none shared across the subtype. The majority of CNVs across all tumors sequenced were broad and affected the entire chromosome rather than a discrete region within the chromosome, despite notable exceptions discussed in later figures. This is consistent with human breast cancer CNVs, which often affect a whole chromosome arm on either side of the centromere [51] except in the cases of strong selective pressure over a particular region, such as the cases of ERBB2 in breast cancer [52] or AR in prostate cancer [53].

Copy number alterations are highly correlated with gene expression

It is well established that CNVs typically correlate strongly with changes in gene expression across cancer types [54] and have been used to target cancer dependencies for therapeutic intervention [55]. Despite this, we sought to establish correlation between gene expression and CNVs in the MMTV-Myc mouse model tumors empirically. When correlating the estimated absolute integer copy number gain or loss for each gene with their matched gene expression sample and extending these results to 33 additional samples profiled by microarray, we obtained high Pearson’s r and Kendall’s τ correlation coefficients broadly across the genome for CNV sites (Fig. 3A). Of note, the highest and most consistent correlation for both Pearson’s r and Kendall’s τ occurs on chromosomes 15 and 11, suggesting the highly conserved ploidy gains across microacinar samples translates well into increased gene expression. Correlations between copy number and gene expression for EMT (Additional file 5: Fig. S1), squamous (Additional file 6: Fig. S2), and microacinar (Additional file 7: Fig. S3) specifically limited to each histological subtype are also available.

Fig. 3figure 3

Copy number changes drive gene expression changes. (A) Circos plot of correlation of copy number changes and gene expression across all tumor samples. Significant Kendall’s rank correlation coefficient (> = 0.3) shown in blue and significant Pearson correlation coefficient (> = 0.7) shown in red. (B) Unsupervised hierarchical clustering of the MSigDB C1 positional dataset for ssGSEA values closely recapitulates the stratification of histological subtypes by gene expression clustering. (C) Principal component analysis (PCA) of C1 positional ssGSEA values reveals distinct clustering by histological subtype

After establishing the link between CNVs and gene expression in the histological subtypes, we wanted to examine whether gene expression differences localized to human defined cytogenetic bands could stratify the histological subtypes. Indeed, when performing ssGSEA on gene expression data using the MSigDB C1 positional gene set, we find that unsupervised hierarchical clustering of these data post normalization largely clusters histological subtypes separately (Fig. 3B), recapitulating clustering from raw gene expression data [13]. These data suggest CNVs or other events confined to specific genomic regions are largely responsible for gene expression differences between histological subtypes. Principal component analysis (PCA) on these same C1 ssGSEA values reveals a similar clustering pattern among subtypes (Fig. 3C). Principal component 1 was able to explain over 25% of the variance in this population, which largely separated the EMT and microacinar subtypes, with squamous tumors infiltrating both clusters or being separated mostly by principal component 2.

Differential mutational landscapes between mouse models of breast cancer

Having established CNVs as being a critical factor in determining gene expression in MMTV-Myc tumors, we sought to investigate whether differing mutations within these tumors played a role in gene expression changes. After limiting gene expression data to genes predicted to have moderate or high impact mutations as determined by SnpEff, we did not see a statistically significant difference in the proportion of genes that were differentially regulated compared to the overall proportion of all genes that are differentially regulated between subtypes (data not shown). This is unsurprising as mutations typically do not affect self-gene expression, except in the case of truncating mutations [56].

Next we examined wholistic mutation patterns between each histological subtype and compared them to mutation patterns in previously sequenced MMTV-Neu and MMTV-PyMT tumors [27]. We found no large differences in overall mutational burden between MMTV-Myc subtypes, although we found trends between mouse models (Fig. 4A). Investigating these trends, we find mutational burden did not diverge significantly between MMTV-Myc subtypes but varied considerably in the MMTV-Neu mouse model. The MMTV-PyMT model exhibited the highest overall mutational burden of mouse models analyzed, but also demonstrated considerable variability between individual tumors.

Fig. 4figure 4

Oncogenic drivers determine mutational heterogeneity. (A) Total counts of overall somatic mutational burden of MMTV-Myc tumors compared to mutational burden of MMTV-Neu and MMTV-PyMT tumors as shown by bar plot. (B) Weights of Catalogue of Somatic Mutations in Cancer (COSMIC) mutational signatures derived from each tumor using DeconstructSigs [48] depicted by stacked bar plot. (C) Venn diagram of conserved mutations (≥ 66% of tumors) between histological subtypes of moderate or high impact predicted by SnpEff. Putatively impactful oncogenes with Sanger sequencing confirmed mutations are listed by their representative histological subtype in which those mutations are conserved

We hypothesized these differences in mutational burden could arise from separate oncogenic drivers and mutational processes within each mouse model given the extensive characterization of mutational processes in human cancers [57, 58]. To examine mutational processes in our mouse models, we utilized deconstructSigs [48] to generate different COSMIC single base substitution (SBS) signatures that take adjacent nucleotides into context and ascribes etiologies to different trinucleotide mutational patterns. We found that all MMTV-Myc tumors regardless of subtype were predominated by the homologous recombination deficient (HRR) signature (Fig. 4B). The HRR signature is strongly associated with germline and somatic mutations in BRCA1 and BRCA2 mutations in human breast cancer [57]. No mutations in BRCA1 or BRCA2 were found in any of the MMTV-Myc tumors analyzed, with these signatures possibly the result of BRCA1/BRCA2 promoter hypermethylation or other factors not analyzed. MMTV-Neu tumors were predominated by the clock-like aging signatures, while MMTV-PyMT tumors demonstrated large tobacco smoking signatures. Given that all mouse models were raised in similar controlled environments, these data suggest an unidentified endogenous C > A mutational mechanism present in MMTV-PyMT tumors.

While overall mutational burden and mutational signatures cannot parse between histological subtypes of the MMTV-Myc tumors, we reasoned that specific mutations may be associated with each subtype. Indeed, we find a small number of conserved (≥ 66% of tumors in each subtype) and impactful mutations within each subtype (Fig. 4C). Notable conserved mutations in the EMT subtype include KRAS G12D activating mutations and splice variants in SCRIB. This may be significant as others have found SCRIB cooperates with MYC for transformation and mislocalization of SCRIB within the cell, sufficient to promote cell transformation [59]. Interestingly, the squamous subtype had no discernible conserved and impactful mutations between tumors. There may be heterogeneous conservation of signaling pathways activated at the transcriptional level, though, as each squamous tumor had impactful mutations in transcription factors: zinc-finger and BTB domain containing (ZBTB) genes, zinc-finger protein (ZFP) genes, or both. For the microacinar tumors, we observed conserved missense mutations at A538E in proto-oncogene c-KIT (KIT) and at A255D for retinoic acid receptor-α (RARA). KIT is a well-established oncogene, particularly in acute myeloid leukemia (AML) [60] where mutations play a large role in pathology, and RARA is involved in embryonic development whose disruption is well established in carcinogenesis [61].

Interestingly, KIT and RARA mutations were found to be mutually inclusive in the tumors sequenced. Of the 9 tumors sequenced at the genome level, 5 were found to have both the A538E KIT and A255D RARA mutations, while the other 4 tumors contained no discernible mutations whatsoever in either KIT or RARA. To validate our WGS findings and evaluate the extent of these mutations further in other MMTV-Myc tumors, we performed Sanger sequencing on the tumors that previously underwent WGS and an additional two tumors from each histological subtype. From these data, we confirmed that these KIT and RARA mutations were mutually inclusive in a larger population of 7 of the 15 tumors that underwent Sanger sequencing. From the total populations analyzed, these mutations were present in 60% of both the microacinar and squamous subtypes, while only 20% in EMT (Additional Files 1 and 2). These data may suggest a link between KIT and RARA given their co-occurrence patterns; however, the exact functional implications of these mutations are not understood.

To gain a better understanding of the potential functional impact of the co-occurring KIT and RARA mutations, we performed multiple sequence alignment on KIT and RARA amino acid sequences from their most prevalent isoform using Clustal Omega [62]. Comparing between species, A538 KIT in the mouse maps to M535 in humans with an overall interspecies amino acid identity of 82.77% for the full protein (Additional file 3). Both mutations reside in exon 10 of their respective species, where the majority of this exon codes for the transmembrane domain in topological space. According to TCGA PanCancer Atlas data shown in cBioPortal, the KIT transmembrane domain in humans spans amino acids 525–545, suggesting that the analogous A538E mutation in the mouse occurs directly in the middle of the transmembrane domain. Additionally, while TCGA PanCancer data do not show confirmed pathogenic mutations in the transmembrane domain of KIT, the transmembrane domain resides between two confirmed mutational hotspots labeled as likely oncogenic (Additional file 8: Fig. S4). We speculate that the A538E mutation will destabilize the transmembrane domain of KIT leading to dysregulated KIT signaling. Follow up functional studies will need to be performed to ascertain the causal effects of this mutation in mice and in humans.

Similarly, A255 RARA in the mouse maps directly to A255 RARA in humans. Importantly, exon 6, where the mutations reside in both species, shares 100% identity, while the whole protein sequences share 89.98% identity (Additional file 4). Exon 6 in both species comprises the beginning portion of the hormone receptor ligand binding region of RARA. The TCGA PanCancer cohort does not contain A255D RARA mutations and most mutations are not confirmed whether they are pathogenic or not (Additional file 9: Supplementary Fig. S5). We speculate that the A255D mutation interferes with retinoic acid binding and heterodimerization with the retinoid X receptor, which thus inhibits the transcriptional activation of downstream genes that would lead to cell differentiation and cell cycle control [63, 64].

Heterogeneous activation of KRAS signaling in the EMT subtype

Previous studies on the MMTV-Myc model have shown that the EMT subtype often possessed activating KRAS mutations and increased RAS signaling [13], whereas the squamous and microacinar subtypes largely did not (Fig. 5A). This was the case for EMT tumors 222 and 1938 that underwent WGS, where both had KRAS G12D missense mutations, while the squamous tumor 1066 acquired the less transforming KRAS G13R mutation (Fig. 5B). Consequently, when comparing ssGSEA values for the C6 oncogenic signature gene sets between the three histological subtypes, EMT tumors consistently had upregulation of various KRAS signaling pathways across tissue types (Fig. 5C). Investigation of a representative KRAS signaling pathway shows the probability of KRAS activation is considerably higher in EMT than both microacinar and squamous samples (Fig. 5D).

Fig. 5figure 5

Heterogeneous activation of KRAS pathway in EMT histological subtype. (A) Proportion of each tumor histological subtype with activating mutations in KRAS in bar plot format. (B) Sequence variation in KRAS between all tumors sequenced as shown by a logo plot illustrates canonically activating mutations in KRAS. (C) A volcano plot of the ssGSEA values from the MSigDB C6 oncogenic signature gene set, showing EMT upregulated or downregulated gene sets compared to microacinar and squamous. (D) Violin plot of a representative KRAS pathway signature from the ssGSEA values of the C6 oncogenic signature gene set, showing distinct upregulation of KRAS signaling in the EMT subtype. (E) Heatmap of log2 fold change in copy number segmentation values showing high-level amplification of FGFR2 in EMT. (F) Canonical molecular pathway signaling reveals FGFR2 lies directly upstream of KRAS (Created with BioRender.com)

Despite its high ssGSEA KRAS activity scores (Additional file 27: Table S2), EMT tumor 1356 possessed no KRAS mutations or mutations in genes elsewhere in the RAS pathway. Copy number segmentation data obtained pointed to a high-level amplification event on chromosome 7, encompassing FGFR2 and ATE1 (Fig. 5E). Of particular significance is the estimated copy number gain of this region. EMT tumor 222 had a similarly bounded focal amplification event over FGFR2 and ATE1 that increased both gene copy numbers by 1, but EMT tumor 1356 has an estimated copy number gain of 471 over this region and correlates well with gene expression levels of FGFR2 (Additional files 26 and 28: Tables S1 and S3). The scale of this amplification event suggests a strong selective pressure for focal amplification of FGFR2 in EMT tumor 1356. When examining the pathways of FGFR2 in the literature, FGFR2 acts directly upstream of RAS-MAPK, PI3K-AKT, and JAK-STAT signaling pathways [65] (Fig. 5F). Thus, we propose that the increase in KRAS signaling seen in EMT tumor 1356 is due to the large, focal amplification of FGFR2 and subsequent activation of RAS signaling. Stemming from the previous assessment, it is likely that the EMT histological phenotype is dependent on increased RAS signaling, ostensibly through heterogeneous genetic mechanisms.

Squamous tumors represent an intermediate phenotype between microacinar and EMT

To this point, there are strong associations between copy number amplifications specific to the microacinar subtype and heterogeneous genetic events that lead to KRAS pathway activation in the EMT subtype. However, there are no readily identifiable conserved genetic features that can explain the squamous phenotype. Despite this, squamous tumors consistently occupy a gene expression state between that of the EMT and microacinar subtypes. PCA of C2 curated gene sets (Fig. 6A) and C6 oncogenic signature gene sets (Fig. 6B) show clustering of squamous samples between microacinar and EMT, with some squamous samples invading both the microacinar and EMT clusters.

Fig. 6figure 6

Squamous represents an intermediate phenotype between microacinar and EMT. (A) PCA of the MSigDB C2 curated and (B) C6 oncogenic signature gene sets for ssGSEA values of the microacinar, squamous, and EMT tumors recapitulates C1 clustering and explains more variance in the data. (C) Pairwise relationship plots for representative C2 gene set and (D) C6 gene set ssGSEA values are shown with linear regression lines and a 95% CI. Pearson R correlation values are shown with p-values determined from a two-sided t-test

From the previous results, we sought to examine the relationship between individual pathways in each gene set that could explain these differences. To accomplish this, we chose the top three differentially expressed representative pathways from EMT and microacinar pathway signatures for C2 (Fig. 6C) and C6 (Fig. 6D) MSigDB gene sets. While there were no statistically significant correlations within each histological subtype (data not shown), likely due to limited numbers of samples, pairwise correlations of pathway signatures across all three subtypes showed both significant positive and negative correlations between ssGSEA pathway activities. Beyond this, top differentially expressed pathway activities appear to lie on a continuum, with squamous tumors routinely spanning between and infiltrating the microacinar and EMT clusters. Pairwise relationships for the C1 positional gene sets shows similar patterns (Additional file 10: Fig. S6). These data are consistent with the squamous histology occupying an intermediate phenotype between that of EMT and microacinar subtypes.

Integrated mouse data stratifies human breast cancer subtypes and yields clinical insights

It is clear that the MMTV-Myc mouse model produces primary mammary tumors that are heterogeneous in histology, gene expression, metastatic variance [13], and now somatic genomic perturbations that can explain many of the transcriptional differences seen in this model. However, the significance of these events and translational potential to humans is not immediately obvious.

To evaluate whether the events seen in MMTV-Myc mouse model could have predictive power in clinical outcomes, we utilized an integrative approach, combining gene expression data and somatic genetic events to examine how they can parse human breast cancer subtypes. To this end, we took all genes that were differentially expressed between MMTV-Myc histological subtypes that were also present in conserved copy number gain or loss events to obtain the integrative gene set (Additional file 29: Table S4). Subsequently, we performed unsupervised hierarchical clustering on human gene expression from the METABRIC breast cancer dataset [66,67,68] limited to the integrative mouse gene set (Fig. 7A). We found distinct clusters emerged that well represented the intrinsic subtypes of breast cancer. Importantly, clustering from the same number of genes used but instead randomly selected failed to resolve intrinsic breast cancer subtype clusters to the same degree (Additional file 1120: Figs. S7–S16). This suggests that the integrated mouse gene set represents a diverse set of informative genes across all intrinsic subtypes of human breast cancer that can effectively differentiate between subtypes rather than contributing noise. Clustering by the PAM50 gene set showed improved subtype clustering overall relative to the mouse integrative gene set but did not resolve the luminal A and luminal B subtypes to the same extent (Fig. 7B). It is important to note that the PAM50 gene set is curated specifically to be able to differentiate between human intrinsic breast cancer subtypes, so seeing improved performance relative to the integrative mouse gene set is expected.

Fig. 7figure 7

Mouse model genetic and transcriptional events inform human clinical outcomes. (A) Unsupervised hierarchical clustering of METABRIC gene expression values by a list of 453 homologous genes that are in a conserved amplification/deletion event and are differentially expressed between MMTV-Myc histological subtypes. (B) Unsupervised hierarchical clustering of METABRIC gene expression values by the PAM50 gene set. (C) Overall survival (OS) Kaplan–Meier (KM) analysis of non-redundant TCGA breast cancer patients, accessed through cBioPortal, stratified by Myc amplification status. (D) OS KM curve of non-redundant TCGA breast cancer patients stratified by KRAS alteration status. (E) OS KM curve of non-redundant TCGA breast cancer patients stratified by FGFR2 amplification status

It is clear that genetic events in the MMTV-Myc mouse model and their resulting gene expression differences can resolve human breast cancer intrinsic subtypes. However, it is unclear to what extent these genetic events in the MMTV-Myc mouse model represent genetic events occurring in human breast cancer. To address this, we assayed publicly available TCGA datasets on breast cancer available through cBioPortal. We assessed prevalence of genetic events in human breast cancer first identified to be conserved both in genomic alteration status and differential gene expression in the MMTV-Myc mouse model. Subsequently, we examined the effects of these genes on human breast cancer overall survival clinical outcomes through Kaplan–Meier analysis.

All of the genetic events examined were found to be co-occurring with MYC amplification in human breast cancer (Additional file 30: Table S5), especially supporting the use of the MMTV-Myc mouse model in studying MYC-driven human breast cancers. However, because of the co-occurring nature of these events and the limited number of patient samples with genetic events and matched clinical outcomes, statistical significance is difficult to achieve in mutually exclusive populations of MYC amplification and identified genetic events. MYC amplification and overexpression are well described in TNBC and basal-like subtypes of breast cancer [1, 69], known as the deadliest subtypes of breast cancer currently [70, 71], which must be accounted for in survival analysis. To remedy this, we look at relative differences in overall survival from MYC amplification compared to analogous genetic events in humans identified in the mouse model.

Kaplan–Meier analysis on breast cancer patients revealed MYC amplification was present in 18.2% of patients surveyed, with median overall survival at 135.2 months (95% CI: 113.7–148.1) compared to the unaltered population with median overall survival at 171.3 months (95% CI: 161.2–182.9) (Fig. 7C, Additional files 31 and 32: Tables S6 and S7). When comparing the patient population of those harboring KRAS activating mutations or amplifications, which account for 2.2% of all breast cancer cases, we find a marked decrease in overall survival compared to the unaltered cohort (Fig. 7D) with median overall survival at 77.7-months (95% CI: 61.8–146.4) for the KRAS altered population compared to median overall survival of 164.3 months (95% CI: 154.3–173.0) for the unaltered cohort (Additional files 33 and 34: Tables S8 and S9). While it cannot be ruled out that MYC amplification co-occurrence in the KRAS altered cohort reduces overall survival, the KRAS altered population maintains a 57.5-month median overall survival deficit to MYC amplification alone. Thus, MYC amplification may only have a modest additive effect to KRAS amplification or activating mutations in this cohort. It is worth noting that the unaltered population in the KRAS cohort has reduced overall survival compared to the MYC unaltered cohort, suggesting most patients with MYC amplification are largely shunted to the unaltered group.

Similarly, patients with focal FGFR2 amplifications, accounting for 1.5% of all breast cancer cohort patients, also exhibit a marked decrease in overall survival compared to the unaltered cohort (Fig. 7E) with median overall survival at 87.7 months (95% CI: 62.4–191.0), contrasting with the unaltered cohort at 163.5 months (95% CI: 154.0–172.9) (Additional files 35 and 36: Tables S10 and S11). Again, patients with FGFR2 amplification fare significantly worse than those with MYC amplification, standing at a difference of 47.5-month median overall survival difference. While KRAS and FGFR2 alterations are infrequent in breast cancer, these alterations may be extremely significant in the prognosis and treatment of their disease.

Altogether, these analyses reveal conserved genetic events between both human Myc-driven breast cancer and the MMTV-Myc mouse model of breast cancer. Importantly, these somatic genetic events are associated with severe drops in overall survival in human breast cancer patients in large excess of what Myc amplification causes.

Machine learning classifier predicts MMTV-Myc histological subtypes correspond to different human breast cancer intrinsic subtypes

Thus far, we have shown that there exist tightly linked transcriptional and genomic perturbations in the MMTV-Myc model, which are heterogeneous across its histological subtypes with clinical implications in humans. However, whether these histological subtypes are representative of different human breast cancer intrinsic subtypes is unclear. Unsupervised clustering by the integrative mouse gene set shows modest ability to parse human intrinsic breast cancer subtypes, but it falls short of being able to discriminate between features (genes) that can best exemplify a given class (intrinsic subtype). To this end, we employed a machine learning model classifier trained on human METABRIC gene expression data to predict which human breast cancer intrinsic subtype each MMTV-Myc histological subtype best represents.

To accomplish this, we first combined the raw METABRIC microarray gene expression data with the MMTV-Myc microarray data (GSE15904), along with additional mouse microarray cohorts for variations of the MMTV-Neu mouse model (GSE42533) and the MMTV-PyMT mouse model (GSE104397) for comparison. PCA of normalized data revealed strong batch effects between datasets that would distort causal biological interpretation (Additional file 21: Fig. S17). In removing batch effects, we employed the parametric empirical Bayes shrinkage adjustment available from ComBat [72], effectively eliminating non-biological differences between datasets (Additional file 22: Fig. S18).

We then sought to narrow down the number of features in this dataset to avoid overfitting the model and make it more generalizable to new patient data for intrinsic subtype prediction. The prediction analysis of microarray 50 (PAM50) is a well-established scoring metric of gene expression data for 50 genes to stratify breast cancer patients into intrinsic subtypes and offer clinical prognoses [73]. From here, we employed recursive feature elimination with cross-validation (RFECV) with a support vector machine (SVM) radial basis function (RBF) kernel estimator to determine the optimal features in the dataset. Surprisingly, 32 specific genes (Additional file 37: Table S12) from the PAM50 subset gave higher cross-validation scores than utilizing all 50 genes (Additional file

留言 (0)

沒有登入
gif