We established four immortalized lymphoblastoid cell lines of a Chinese Quartet family, including father (F7), mother (M8), and monozygotic twin daughters (D5 and D6) (Fig. 1a). The Quartet DNA reference materials are genomic DNA (gDNA) extracted from each immortalized lymphoblastoid cell line in large single batches. They have been certified by China’s State Administration for Market Regulation as the First Class of National Reference Materials and are extensively being utilized for proficiency testing and method validation. The certified reference material numbers are GBW09900 (D5), GBW09901 (D6), GBW09902 (F7), and GBW09903 (M8).
Fig. 1Study design with monozygotic twins and data generation. a Overview of the study design. Briefly, DNA reference materials were constructed from immortalized cell lines of a Chinese Quartet with father (F7), mother (M8), and monozygotic twin daughters (D5 and D6). They were sequenced by four short- and three long-read platforms at seven labs. Small variant and structural variant benchmark calls were integrated from massive sequencing datasets. Performance of a test dataset can be evaluated by comparing with benchmark calls or genetic built-in truth within the Quartet family. b Schematic overview of short-read sequencing datasets. Three replicates for each of the Quartet DNA reference materials were sequenced in nine batches, by both PCR and PCR-free libraries on four sequencing platforms at six labs, resulting in 108 WGS libraries. c Schematic overview of long-read sequencing datasets. One replicate for each of the Quartet DNA reference materials was sequenced per batch by PacBio Sequel, PacBio Sequel II, and ONT. Eleven combinations of three mappers and five callers were used to detect structural variants, resulting in 120 variant calling datasets
To unbiasedly characterize germline small variants and SV benchmark calls, we sequenced all four Quartet genomes on four short-read (Illumina HiSeq and NovaSeq, MGI MGISEQ-2000, and DNBSEQ-T7 (30-60x coverage)) and three long-read (Oxford Nanopore Technologies (ONT) (100× coverage), Pacific Biosciences (PacBio) Sequel (80× coverage), and PacBio Sequel II (30× coverage)) sequencing platforms at seven centers. We then used four orthogonal technologies, including linked read sequencing (10× Genomics (30× coverage)), SNP array (the Axiom Precision Medicine Research Array (PMRA)), optical sequencing (BioNano), and PacBio circular consensus sequencing (CCS) reads (50× coverage) to validate and refine the benchmark calls (Fig. 1a and Additional file 2: Table S1).
A total of 108 germline small variants call sets were obtained from 27 short-read WGS libraries of each Quartet genome using the widely adopted GATK best practices (BWA-MEM and HaplotypeCaller (HC)) (Fig. 1b and Additional file 3: Table S2). A total of 120 germline SV call sets were obtained from three long-read WGS libraries of each Quartet genome with 11 combinations from three aligners (NGMLR [27], minimap2 [28], and pbmm2) and five callers (Sniffles [27], NanoSV [29], cuteSV [30], SVIM [31], and pbsv) (Fig. 1c and Additional file 4: Table S3 and Additional file 5: Table S4).
Variants call sets of the monozygotic twins are expected to be the same, because the twins share the identical genome from their parents. When investigating the consistency of call sets from different sequencing platforms, variant calling methods, and Quartet samples (Additional file 1: Fig. S1), we observed that SNVs, small indels (<50 bp), large insertions, or large deletions (≥50 bp), were clustered distinctly based on the identity of the Quartet samples, and the monozygotic twins were grouped together as expected. However, for large duplications, inversions, or translocations (≥50 bp), the call sets did not cluster by the identity of the Quartet samples, but revealed strong clustering by bioinformatic pipelines, indicating lack of reliability of or consistency in bioinformatic pipelines for these three types of SVs. Thus, these three types of SVs were not included in the benchmark sets.
Determining small variant benchmark calls and regionsTo define germline variant benchmark calls, we first selected reproducible variants among call sets for each of the Quartet samples. Because the number of Mendelian violations was much higher than the expected number of de novo mutations or somatic mutations arising from cell culture, all Mendelian violations were assumed to be errors [20]. Thus, we excluded Mendelian violations from the benchmark calls, even when they were reproducible among call sets.
We generated one small variant benchmark dataset by integrating 108 call sets (27 call sets per sample) of all four Quartet samples based on short-read WGS. At the individual sample level, we obtained a total of 6 million variants of 27 call sets at the beginning, and an average of ~4.6 million consensus variants after voting (see “ Methods”) across triplicates in a batch, sequencing labs, and library preparation methods (PCR-free and PCR) (Fig. 2). To check Mendelian consistency of the remaining variants, genotypes should be confidently detected in all four Quartet samples for each variant. We then removed a total of 412,054 variant positions with no-call or conflict genotypes among the 27 call sets of any Quartet sample. Compared with irreproducible variants filtered during the voting process, these removed variants showed higher variant allele frequency (VAF), read depth, mapping quality, and genotype quality (Additional file 1: Fig. S2). Therefore, they could not be removed by simply increasing variant filtration thresholds.
Fig. 2Integration workflow of Quartet small variant and structural variant benchmark calls. This workflow depicted the integration process to obtain small variant benchmark calls from 108 original GVCF call sets. Numbers in the boxes represented remaining small variants after each data processing step in the grey dotted boxes. Approximately 6 million small variants were discovered in 27 call sets for each Quartet reference sample. About 1.5 million small variants were removed by the voting process (“ Methods”). We merged the four consensus call sets corresponding to the four Quartet samples, and discarded variants that did not reach agreement across 27 replicates in any Quartet sample. Only Mendelian consistent variants, which were shared by twins and following Mendelian inheritance laws and validated by PMRA and PacBio CCS datasets, were kept as small variant benchmark calls
We identified 5,708,723 small variant positions with reproducible genotype calls among all four Quartet samples. These remaining variants were further examined for Mendelian consistency in the Quartet family, and 7329 (0.13%) of them were identified as Mendelian violations. We manually inspected 4761 variants located in the callable regions (see “Methods”) with high mapping quality. Of the 3221 validated small variants, 1034 overlapped with large deletions. They were mistakenly considered as Mendelian discordant by Mendelian analysis software VBT [32], which was based on the hypothesis that variants always passed on diploid. Comparing with the variants detected in the matched blood samples of the Quartet family members, we found 95 pre-twinning germline de novo variants shared by the twins (homozygous reference in the parents and heterozygous or homozygous alternative in the twins), one postzygotic germline de novo variant specifically found in Quartet-D5, 1532 somatic variants (also found in blood), and 556 variants probably accumulated from cell culture (not found in blood) (Additional file 6: Table S5). Finally, we kept the Mendelian violations confirmed by manual curation into the initial catalog of benchmark calls. This process resulted in about 4.2 million well-supported small variants for each Quartet sample.
Previous studies show that PacBio CCS reads yield a higher variant calling accuracy compared with short-read NGS, especially when calling variants in the repetitive regions of the genome. When comparing with the variants based on 50× coverage of PacBio CCS reads, we found that 98.7% of SNVs and 95.0% of small indels in our benchmark dataset can be validated (Additional file 2: Table S6). The 89.7% unvalidated ones were found to be located in the repetitive regions of the genome, especially segmental duplications (41.6%) and centromere regions (27.9%).
We also validated the small variant benchmark dataset using 16 replicates of PMRA SNP array. We obtained 793,024 Mendelian consistent probes in the benchmark regions that were well-supported by most replicates from the 902,394 clinically related probes assayed on the PMRA array. Of those reliable probes, 99.99% homozygotic references, 98.6% SNVs, 95.7% small insertions, and 96.2% small deletions were the same with the NGS consensus variants (Additional file 2: Table S7). Among the 2845 discordant variants, 2704 were detected by the PMRA array but were absent from NGS. We manually inspected the read alignment and found that the remaining 141 calls were either missed by NGS or genotyped differently from the PMRA array, and only seven were obvious false positive in the NGS consensus calls due to misalignment of NGS reads. The seven obvious false positives were later removed from the small variant benchmark calls. Consequently, the two validation processes removed 61,532 SNVs and 61,152 indels from the benchmark call sets.
To enable the identification of false positive and false negative variants, we defined benchmark regions for small variants (Additional file 1: Fig. S3). These benchmark regions were derived by integrating callable regions, which are regions where short reads can be accurately mapped to the human reference genome with high mapping quality. Within the benchmark regions, we excluded high-confidence large deletions and insertions integrated from long reads, as well as their flanking regions (50bp). The benchmark regions were defined as high-confidence variant regions and homozygotic reference regions within the consensus callable regions, as determined by all Quartet samples. These regions covered approximately 87.8% of the GRCh38 reference genome (~2.66 G; chr1-22, X). Consensus and Mendelian consistent variants outside the benchmark regions were not included in the final benchmark call sets (Table 1).
Table 1 Summary of quartet small variant and structural variant benchmark calls and regionsWe further compared the small variants benchmark calls with high-confidence call sets from two accompanying studies [33, 34] (Additional file 1: Fig. S4). These two high-confidence call sets provide orthogonal confirmation of our calls (FDU), since Pan et al. (NCTR) [33] integrated high-confidence calls from four mappers (Bowtie2, BWA, ISAAC, and Stampy) and eight callers (FreeBayes, GATK-HC, GATK-HC (sentieon), RTG, ISAAC, Samtools, SNVer, and Varscan), and Jia et al. (XJTU) [34]. constructed haplotype-resolved high-confidence calls by combining short-read and long-read technologies. We compared variants in the intersect of the three high-confidence regions of the three studies and found that 99.9% SNVs and 99.2% indels in our FDU callset could be confirmed by either the NCTR callset or the XJTU callset.
Determining structural variant benchmark calls and regionsA similar strategy was used to determine SV benchmark calls by integrating the 120 call sets obtained from the long-read WGS data (Fig. 3, “ Methods”). For each individual in the Quartet family, the tool Jasmine was used to merge SV callsets from different sequencing platforms and SV calling pipelines based on the variant breakpoint and length [35]. This left about 90,000 isolated SVs of each Quartet sample. Then, SVs supported by the same pipeline from at least two sequencing platforms or by at least six pipelines from the same platform were determined as consensus SVs. Large SVs over 10 Mb and the ones located in centromeres, peri-centromere, and gap regions of the reference genome were excluded. The remaining 31,659 SVs were then re-genotyped in a pedigree using three genotypers (Sniffles [27], SVjedi [36], and LRcaller [37]) with the reads of PacBio Sequel and ONT. Consensus genotypes (23,891) from at least six of the ten genotype call sets were then determined as the consensus genotype calls for each of the Quartet samples. SVs with conflict genotypes had higher VAF (0.12–0.25 and 0.75-1.0) compared with discordant variants among replicates (0.12), but not as high as VAFs at peaks near 0.5 (heterozygous) or 1.0 (homozygous), respectively (Additional file 1: Fig. S5).
Fig. 3Integration workflow of structural variant benchmark calls. This workflow depicts the integration process to obtain structural variant benchmark calls from 120 call sets. Numbers in the box represented remaining structural variants after each data processing step in the grey dotted boxes. Briefly, approximately 90,000 structural variants were discovered in 30 call sets of each Quartet reference sample. We first kept structural variants supported by at least two sequencing platforms or at least six pipelines from one sequencing platforms, then removed SVs with length over 10 Mb or located on centromeric or pericentromeric regions and gaps. INSs and DELs were extracted for the construction of structural variants benchmark calls. Sniffles was used to report structural variants sequences, and structural variants that failed in reporting sequences were filtered. Iris was applied to refine variant sequences. After obtaining consensus of structural variants in multiple data sets, we merged four catalogs of reproducible variants of each Quartet reference sample and obtained 31,659 SVs in total. Three genotypers were used to determine genotypes of these SVs, and only SVs with consensus genotypes in at least six of all ten genotype call sets were kept for Mendelian analysis. The final structural variant benchmark calls were shared by twins and followed Mendelian inheritance laws with parents
After obtaining consensus genotyped SVs, we then removed Mendelian violated SVs. Of the 194 Mendelian violated SVs, we found that two de novo heterozygous variants shared by the twins, and four individual-specific heterozygous variants from one of the twins, which probably were somatic or arose from cell culturing (Additional file 2: Table S8). Following manual curation, we observed that the remaining 188 SVs were incorrectly genotyped. Most of them (91.7%) were located in regions of simple repeats over 100 bp or segmental duplications, or clustered with other variants. Finally, ~15,000 benchmark SVs were kept into the benchmark call set for each Quartet sample (Table 1). Consistent with prior studies, we observed three peaks near 300 bp, 2.1 kb and 6 kb, likely reflecting the activities of Alu elements, SVA elements, and full-length LINE1 elements in the human genome (Additional file 1: Fig. S6).
Validating based on Illumina short reads, 10X Genomics linked read, BioNano optical mapping, and whole-genome assemblies using PacBio CCS and PacBio CLR data, we found that our SV benchmark callset is of high quality (Additional file 2: Table S9). The overall validation rates of insertions and deletions were 95.24 and 95.78% by at least one technology. Although we integrated short-read SV validation callset using 15 SV callers, the validation rates by short-reads (48.7% INS and 76.0% DEL) were much lower than long-read assemblies (90.7% INS and 92.6% DEL). BioNano only validated 3.2% INS and 1.8% DEL over 1 kb, due to its low resolution (kb) by specific restriction enzyme cut sites and failure to accurately determine breakpoints [38]. We also validated our SV benchmark callset with Jia et al. [34] and found that 97.1% INS and 91.9% DEL were confirmed.
We also compared our SV benchmark calls to the SVs identified by GRC [39], HGSVC [40], and HX1 [41] with different groups of samples. The validation rates were 91.3, 77.8, and 54.7%, respectively. The high validation rate of GRC was because a Chinese sample was included, and the SVs were also detected from long-read data. Note that such comparison based on a limited number of samples will only detect the common SVs that are shared in different samples.
To define SV benchmark regions, we used ~100x PacBio Sequel CLR reads to establish haploid de novo assemblies for the parents F7 and M8 (2.94–2.99 Gb), and diploid de novo assemblies for the twins D5 and D6 (2.87–2.88 Gb). We then mapped de novo assemblies to the GRCh38 reference genome, and ~2.74~2.78 Gb callable regions were retained which were supported by reads larger than 50 kb and with mapping quality greater than 5. Regions of assembly-specific SVs, centromeres, and gaps were excluded from callable regions (Additional file 1: Fig. S7). The Quartet SV benchmark regions cover ~2.62 Gb of the reference genome (GRCh38; chr1-22) and contains ~12,705 (75.7–83.6%) SVs of the benchmark calls. Only SVs inside the benchmark regions are considered when we evaluate variant calling performance of test sets based on benchmark sets with precision and recall.
Applications of the Chinese Quartet genomic reference materialsEvaluating variant calling performance by pedigree information and benchmark setsWe used the whole-genome variant callsets derived from various library preparation methods, sequencing platforms, and bioinformatic tools to demonstrate the usability of the Quartet DNA reference materials in evaluation of variant calling performance. Each callset was evaluated based on the F1 score in the benchmark regions and the Mendelian consistent rate (MCR) on the whole genome.
Four mappers (Bowtie2, BWA, ISAAC, and Stampy) and eight germline variant callers (HaplotypeCaller (GATK version and Sentieon version), RTG, ISAAC, Varscan, FreeBayes, Samtools, and SNVer) were compared based on ~30× Illumina short-read replicates from three sequencing centers (Detailed information can be found in our companion study [33]) (Fig. 4a). Callers had greater impact on variant calling accuracy compared with mappers. SNV calling performance was high and similar (F1 score 0.978±0.012, MCR 0.944±0.017) among different callers, while indels calling performance was lower and varied (F1 score 0.732±0.158, MCR 0.695±0.094). RTG, Sentieon, and HaplotypeCaller showed higher F1 scores for indel calling, with Samtools and SNVer performing the worst.
Fig. 4Evaluating variant calling performance by pedigree information and benchmark sets. F1 score and MCR rate of different a mappers and callers for detecting small variants using Illumina short reads; b sequencing platforms and library preparation methods for detecting small variants; c callers for detecting SVs using Illumina short reads; and d sequencing platforms, combination of mappers and callers for detecting SVs using long-read data
To investigate the small variant calling performance of different sequencing platforms, we called small variants using the same pipeline (Sentieon) for short-read data and DeepVariant for PacBio CCS reads (Fig. 4b). Illumina platforms, MGI platforms, and PacBio CCS had similar performance, with no obvious differences. Sequencing platforms had smaller impact on variant calling accuracy compared with library preparation methods. PCR-free libraries were superior to PCR libraries for detecting Indels, with higher F1 scores (0.983±0.005 vs 0.958±0.016) and MCR rates (0.921±0.050 vs 0.873±0.094).
For investigating SV calling performance, we compared 15 common callers using short-read data (Fig. 4c). Different callers had various SV calling performance, with F1 scores ranging from 0 to 0.891 and MCR rates ranging from 0 to 0.645. Detection of DEL by short reads was slightly accurate than INS. Only Manta exhibited relatively high F1 score and MCR rate for both INSs and DELs compared to other callers. The MCR of INSs called by DELLY, GRIDSS, and MELT was much higher than F1 score evaluated by benchmark calls, because they detected fewer variants and had lower recall rates. We observed that most callers achieved high performance of DEL results, except for CNVnator, inGAP, and svaba. inGAP identified many more DELs (60,151) than benchmark calls, but had low precision and recall at the same time, indicating its low accuracy.
We also investigated SV calling performance of long-read sequencing platforms and bioinformatic pipelines, by retrospectively evaluating the performance of structural variants call sets used in this study to establish the benchmark sets (Fig. 4d). Generally, more SVs were detected from long reads (7726±3,203) than short reads (4922±9,604), and present sequencing technologies and algorithms display higher performance for DEL detection than INSs. Combination of mappers and callers should be carefully chosen according to sequencing platforms, since different combinations had F1 scores ranging from 0.374 to 0.856 and MCR rates ranging from 0.119 to 0.437. NGMLR with cuteSV showed high performance detecting both DELs and INSs on all three long-read sequencing platforms. Pbmm2 with pbsv, which was specifically developed for the PacBio platform, performed better on PacBio Sequel II than Sequel. Notably, DELs detected by pbmm2/sniffles had low F1 score but high MCR. Compared with the median het/homo ratio 2.2:1 in 30 call sets, het/homo ratio of pbmm2/sniffles was 0.02:1, which resulted in ~98% SVs of all four individuals with 1/1 genotypes, indicating that the genotypes of the pipeline were unreliable.
We found that an average of 9% SNVs, 40% indels, 33% DELs, and 20% INSs were located outside the benchmark regions, which could not be evaluated by benchmark sets. The F1 scores for variants inside the benchmark regions might not reflect the accuracy outside the benchmark regions (Additional file 1: Fig. S8a). As expected, the error rates were significantly higher outside of the benchmark regions. Moreover, the Quartet family design identified more false positive variant candidates compared to twins and trios and enabled a more precise measurement of error rates (Additional file 1: Fig. S8b).
Identifying and mitigating batch effects in genomic sequencingTo identify batch effects in WGS using the Quartet DNA reference materials, we performed principal component analysis (PCA) on genotype calls detected from various short-read sequencing platforms. Compared with RNA sequencing, DNA sequencing revealed a much smaller level of batch effect [22]. In the scatterplot of the first two eigenvectors, the monozygotic twin daughters were clustered together and located in the middle between the two parents in PC1 and above the parents in PC2, as expected (Additional file 1: Fig. S9). We observed a clear batch effect from the third and the fourth eigenvectors (Fig. 5a–d). The sequencing platforms played an important role in leading to such detectable batch effects. Large insertions exhibited the lowest reproducibility across the sequencing platforms compared with other variant types, because obvious batch effects were observed even from the first two eigenvectors. Variants called outside the benchmark regions showed larger batch effects than variants called inside the benchmark regions, as expected, because more variants outside the benchmark regions could not reach agreement among call sets (Additional file 1: Fig. S10).
Fig. 5Quartet DNA reference materials can be used to identify and mitigate batch effects in DNA sequencing. The scatterplots of the third and the fourth eigenvectors generated from PCA show batch effects in a SNVs, b small indels, c large deletions, and d large insertions. e Reproducibility of variants called on the whole-genome region before and after filtration. f Precision of variants called inside the benchmark regions before and after filtration. g Recall of variants called inside the benchmark regions before and after filtration
Batch effects can be mitigated by removing false positive variants in each batch due to different variant quality metrics such as quality scores, read depth, and mapping quality scores. Pedigree information of the Quartet DNA reference materials can be used to select proper thresholds of those variant quality metrics for each batch to filter potential artifacts. We trained a one-class SVM (support vector machines) classifier using variant quality metrics of Mendelian consistent variants (reliable variants) from one of the three replicates for each batch (Additional file 2: Table S10, batches 5, 6, and 7). Then the trained models were applied on the other two replicates to filter potential false positives for each batch. The efficiency of batch-specific filtration method was assessed by precision, recall, and cross-batch reproducibility (Fig. 5e–g). After filtration, the cross-batch reproducibility was greatly improved. The precision compared with the benchmark calls increased, while the recall rates decreased, indicating that false positives were greatly reduced with inevitably sacrificing a small number of true variants.
Evaluating variants called from mRNA and proteinApart from DNA reference materials, we also established RNA, protein, and metabolite reference materials from the same large batch of B-lymphoblastoid cell lines. Multiomic reference materials from the same resources of Quartet cell lines provide possibilities for cross-validating biological findings from one type of omics dataset by other levels of omics datasets, supporting quality assessment of a wide range of new technologies and bioinformatics algorithms.
We illustrated a cross-omics validation of variants detected using the Quartet genomics, transcriptomics, and proteomics datasets. As shown in Fig. 6a, an average of 15,580 RNA variants and 18 missense single-amino acid variants were detected in RNA-seq and LC-MS/MS-based proteomics of Quartet D5, respectively. We compared the variants called by GATK HaplotypeCaller and DeepVariant [42] in the intersected genomic regions of benchmark regions developed in this study, CDS regions, and regions with a minimum of 3× coverage. On average, GATK detected 16,305 SNVs and 1338 Indels, while DeepVariant detected 13,181 SNVs and 334 Indels in these intersected regions. DeepVariant called fewer variants, but yielded higher precision and recall than GATK for both SNVs and Indels (Fig. 6b). About 8.3% RNA variants from DeepVariant and 25.9% from GATK in RNA-seq could not be validated by DNA small variant benchmark calls. When comparing false positives against known RNA editing sites in REDIportal [43], we observed that DeepVariant disregarded RNA editing event (17 out of 1100), whereas GATK detected more RNA editing events (562 out of 4234) (Fig. 6c). This indicates that RNA editing does not have a significant contribution to the high level of inconsistency observed between variants identified from DNA and RNA sequencing datasets. Instead, the discrepancy is primarily attributed to technical artifacts. Figure 6d shows that a specific SNV benchmark call can be validated by both RNA and protein sequencing data. A missense SNV (chr17:74,866,471 T>C) caused a single-amino acid mutation, changing from glutamic acid to arginine.
Fig. 6Evaluating variant calling accuracy from RNA and protein data by benchmark variants constructed from DNA data. a Schematic of central dogma and the number of variants detected in the Quartet DNA-seq, RNA-seq, and LC-MS/MS-based proteomics datasets. b Validation of Quartet RNA variants using DNA reference datasets. True positive (TP) means RNA variants validated in DNA reference datasets, whereas false positive (FP) means the RNA variants not included in the DNA reference datasets. c Composition of RNA variant types in false positive (RNA_FP) and true positive RNA (RNA_TP) variant calls. d A T-to-C variant (located in chr17: 74866471) detected by both DNA-seq and RNA-seq is visualized in IGV. The corresponding Glu-to-Arg variant was also detected by LC-MS/MS-based proteomics
These preliminary cross-omics validation results implicated that current applications for variant detection from RNA sequencing and LC-MS/MS-based proteomics remain a challenge. The Quartet multiomics reference materials and datasets enable objective quality assessment of these emerging bioinformatics algorithms from cross-omics validations.
Comments (0)