Hybrid-hybrid correction of errors in long reads with HERO

Approach

We have designed and implemented HERO ([H]ybrid [E]rror co[R]recti[O]n), a novel approach that makes combined use of DBGs and MAs/OGs for the purposes of HEC of TGS reads. HERO is “hybrid-hybrid” insofar as it uses both NGS and TGS reads on the one hand, hybrid in terms of using reads with complementary properties, and both DBGs and MAs/OGs on the other hand, hybrid with respect to the employment of complementary data structures.

The foundation of HERO is the idea that aligning the short NGS reads with the long TGS reads prior to correction yields corrupted alignments because of the abundantly occurring indel artifacts in the TGS reads. This has exposed approaches that are exclusively (multiple) alignment (or even OG) based as inferior, which has blocked further progress in terms of the usage of (multiple) alignments/OGs in HEC. When analyzing earlier related work further (such as most importantly LoRDEC [13], FMLRC [14], and Ratatosk [15]), it becomes evident that DBG-based approaches have the power to remove sufficiently many mistaken indels from the TGS reads to render NGS-TGS-read alignments sufficiently meaningful.

This motivates the strategy that we have implemented. First, we use DBGs for removing artificial indels from the TGS reads to a degree that does no longer disturb subsequent alignments. Second, we use overlap derived alignments for identifying mutations that characterize near-identical haplotypes. We recall that mistaking haplotype-specific mutations for errors is a primary source of issues when making use of DBGs alone.

Examining the prior approaches, this sequential protocol—first DBGs to make alignments work and second, now equipped with reliable aligments, MAs/OGs to take care of haplotype-specific, linked mutations—appears to be imperative. HERO implements this basically sequential strategy.

Note that this protocol can be extended to iterative schemes, where single elements of this protocol are repeatedly applied. To refine and optimize the correction, HERO also makes use of such iterative strategies. In this, the predominant principle is repeated application of DBGs followed by repeated application of MAs/OGs; see the “Workflow” section and the “Methods” section for further details.

As for the implementation of the different parts of the protocol, we found that for DBG-based pre-correction, employing known, heavily researched, so sufficiently optimized methodology lead to optimal results. In particular, we build on LoRDEC [13], FMLRC [14], and Ratatosk [15] for implementing the DBG-related parts of the workflow of HERO. To the best of our understanding, the protocols and workflow that we present are novel concepts.

Following DBG-based pre-correction, the MA/OG-based part that we present is methodologically novel also in its own right. We recall that research on MA (or even OG)-based approaches for HEC has been in its infancy. This motivates the substantial further developments of ours.

We have drawn our inspiration for the MA/OG-based part predominantly from haplotype-aware assembly approaches; see [9, 21, 37] for examples. The basic principle that underlies these strategies is to identify SNPs in the overlaps of the TGS reads. Without disturbing indel artifacts in the TGS-NGS-read alignments—removed thanks to the DBG part—identification of haplotype-specific SNPs is now possible.

After the identification of SNPs, HERO reconsiders the alignments of the NGS reads with the TGS reads themselves. It is now possible to phase the NGS reads by discarding mistaken and maintaining correct TGS-NGS-read alignments. The removal of mistaken TGS-NGS-alignments leads us to a superior level of correction, namely the level of correcting TGS reads relative to the phases (haplotypes/strains etc) they stem from. Enabling this level of correction means to have put an improved correction strategy into effect.

The advantages of our novel strategy become most evident in scenarios relating to DNA samples characterized by multiple haplotypes/strains/lineages. We will therefore focus on metagenomics as one of the envisioned primary domains of application of HERO, on the one hand, and on diploid/polyploid plant genomes, on the other hand. Note in particular that HEC methods so far presented predominantly struggle with metagenome datasets or sequencing scenarios that are characterized by the issues inherent to polyploid or mixed samples.

From this point of view, we present the first HEC method applicable for metagenome sequencing data in particular and an HEC method that also caters to polyploid scenarios in a particularly favorable way. In a brief summary (see the “Results” section for details), HERO reduces indel and mismatch error content by on average 65% (indels) and 20% (mismatches) in the metagenome datasets in comparison with prior state-of-the-art approaches, when correcting errors in PacBio CLR and ONT reads (as the classes of TGS that yield the longest reads). These advantages are certainly substantial; for indels, as the major class of errors affecting TGS reads, they are arguably even drastic. Last but not the least, we will demonstrate that when using HERO prior to (strain-aware) metagenome assembly or haplotype-aware assembly of polyploid genomes, results become improved substantially as well: one experiences considerable advantages in terms of the accuracy and completeness of the draft genomes, as the two most important categories, in particular. This underscores the practical necessity to invest in refined, optimized error correction strategies.

In the following, in the “Workflow” section, we present the workflow of HERO. Subsequently, in the “Data and experimental setup” section, we discuss the data, the design of the experiments. In the subsequent subsections, we present the results that HERO achieved in comparison with the state-of-the-art HEC approaches.

Workflow

In this section, we provide a high-level description of the workflow. See the “Methods” section for detailed descriptions of all methodical steps involved.

See Fig. 1 for an illustration of the overall workflow of HERO. As outlined above, HERO consists of two stages.

Fig. 1

The workflow of HERO. Black: long read r to be corrected. Yellow, blue, red: short reads, different colors indicate different haplotypes. Red crosses indicate errors. Tics (here: blue, red) indicate haplotype-specific variants

Fig. 2

Experiments to determine appropriate amounts of iterations for both DBG-based pre-correction of long reads, as performed by FMLRC, LoRDEC, and Ratatosk, and subsequent OG-based correction by HERO-OG, as per steps 1 to 6 in Fig. 1 . The x -axis indicates the different iterations. The y -axis indicates the resulting error rates. Experiments were performed on dataset containing simulated PacBio CLR reads from 3 Salmonella strains. In a and c, only FMLRC, LoRDEC, and Ratatosk were used. In b and d , 3 iterations of FMLRC, LoRDEC and Ratatosk are followed by 5 iterations of HERO-OG. Results point out that for DBG-only-based approaches, 8 iterations are optimal, while for hybrid (DBG and OG) approaches, 3 iterations of DBG, followed by 5 iterations of OG yield optimal error rates

The first stage consists of employing well-established, high-quality DBG-based HEC tools such as Ratatosk [15], FMLRC [14], and LoRDEC [13]. Attempts of ours to achieve further progress in terms of the DBGs involved did not lead to further improvements. Therefore, we also refrain from providing further illustration but refer the reader to the publications just mentioned for details.

The second stage is the (multiple) alignment (and also overlap graph)-based correction, which is methodologically novel, and consists of six major steps. These six major steps are illustrated in more detail in Fig. 1. Numbers in the Fig. 1 refer to the number of steps as explained here.

Step 1 consists of computing minimizer-based all-vs-all overlaps between each NGS read and each TGS reads on the other hand. For this, we make use of Minimap2 [19], as the currently leading tool for that purpose.

Steps 2–5 reflect the technical core of our read phasing and MA/OG-based error correction procedure. See Fig. 1 for an illustration.

In step 2, HERO picks one TGS reads as the target read and computes an NGS read alignment pile that consists of all short reads that overlap the target read. The NGS read alignment pile may include substantial amounts of NGS reads that originate from haplotypes that differ from the one the target read was drawn from. Keeping such NGS reads in the pile would lead to “overcorrection” of the true mutations contained in them. To appropriately deal with the problem, we phase NGS reads, and filter out those that stem from other strains, based on the SNP information that one obtains from the overlap computations. We therefore identify NGS reads that originate from phases the target read does not stem from and remove them from further consideration. In Fig. 1, the blue and the red NGS reads are filtered out because characteristic SNPs (blue and red tics) could be identified. Mismatches affecting the yellow NGS reads (stars) all lack coverage support, so yellow reads stay in the alignment pile.

In step 3, the now cleaned and “phase-unique” set of short reads is divided into small segments the coordinates of which are given by the overlaps of the short reads with the target long read, which acts as a virtual coordinate system. Each of the segments gives rise to a window like part set of NGS reads that overlap the target long read; the segment of the target read itself in a particular window is referred to as “target subread” in the following.

Subsequently, in step 4, we now virtually discard the target read for the time being. For each window, we then compute a partial order alignment (POA) [38] of the NGS read segments that pertain to the particular window. Note that the segmentation into window-based target subreads is crucial, because a POA, which is a particular, graph-based type of multiple alignment, tends to accumulate artifacts if not confined to small genomic regions, as is usual for multiple alignments that span too long sequences. The POAs of the short segment subreads then are the foundation for a (phase-unique!) consensus of the short read segments that make part of the POA. Note that computing such a consensus virtually reflects to construct an OG from the POA graph (which is an immediate procedure, as the POA graph spells out all necessary overlaps) and traverse that OG for assembling the short read segments into the desired consensus sequence. These consensus sequence segments establish the corrected sequence of the target subread; therefore, one replaces the original target subread with that consensus segment, to obtain an error corrected version of the target subread.

Finally, in step 5, the corrected “target subreads” are re-extended into the complete, full-length reads from which they were computed. Because the target subreads have now been successfully corrected, the full-length read is corrected as well.

Step 6 then consists of collecting all corrected target reads. In our experiments, we found that iterating the MA/OG stage three times lead to optimal results. This explains why we refer to HERO as the protocol that runs the MA/OG-based sequence of steps 1–6 three times. The resulting corrected TGS reads are the output of HERO.

Note, finally, that one can iterate this process in an overarching manner and re-consider the resulting corrected reads as input to the correction procedure (which consists of 3 MA iterations in itself) another time. In that, one can now vary whether one makes repeated use of the DBG stage, the MA stage, or both of them. We make use of such ideas in the following as well. As a general insight, we found that repeated application of the DBG stage (for refined, optimized removal of indel artifacts) followed by repeated application of the MA stage (for refined resolution of phases) had the potential to yield further improvements.

Data and experimental setup

The datasets that we deal with can be classified into three parts.

The first part refers to simulated metagenomic data, which includes 6 spiked-in datasets. As described in full detail in the “Methods” section, depending on numbers of strains included, we distinguish between simulated datasets of “3 Salmonella strains,” “20 bacterial strains,” and “100 bacterial strains.” In addition, we consider the just mentioned “strain-mixing spike-in” datasets, which result from spiking real data with simulated reads generated from known Salmonella strains.

The second part refers to the real metagenomic datasets. To evaluate the performance of HERO and prior state-of-the-art approaches in real data scenarios, we considered the two datasets Bmock12 and NWCs (for details, see the “Methods” section), because they have available reference genomes (so available ground truth), and all of Illumina, PacBio CLR, and ONT reads ready for download.

The third part refers to real diploid/polyploid plant genome datasets. In detail, we considered the diploids Arabidopsis thaliana, Arctia plantaginis, Oryza sativa japonica, and Oryza sativa aromatic, as well as the tetraploid mango (Mangifera indica).

Note that the lack of applicable haplotype-resolved reference genomes for the plant genomes requires to evaluate the experiments in a different way: instead of MetaQUAST, we apply Merqury [39] for evaluating results, as an approved tool for such scenarios.

We refer the reader to the “Methods” section for technical details on the datasets.

Experiments: simulated metagenomic data

In the following, headings of paragraphs refer to different datasets. For detailed descriptions of the datasets, see the “Methods” section. In general, experiments are discussed in the order by which datasets are explained in the “Methods” section.

3 Salmonella strains

This dataset consists of simulated reads generated from 3 Salmonella strains; see the “Methods” for details (Fig. 2). The dataset reflects a simple scenario that allows us to have a clear view on basic adjustments of our workflow. Following our workflow (see Fig. 1), we first pre-corrected the TGS reads using one of the DBG-based HEC methods LoRDEC [13], FMLRC [14], and Ratatosk [15]. Fig. 3a, c display the error content of the TGS reads that remains after iterative application of one of the three DBG-based HEC methods, where the maximum number of iterations was 8. All FMLRC, LoRDEC, and Ratatosk have in common that results markedly improve during the first 3 iterations, and stabilize thereafter; no significant further improvements can be observed after the third iteration. The only exception to that rule is Ratatosk, which exhibited further, albeit only slight improvements beyond the third iteration with respect to indel error content; note however as well that Ratatosk was outperformed by the other two methods with respect to indel error content, opposite to results achieved on mismatch content, where Ratatosk outperforms the other two tools. Based on monitoring performance relative to the number of iterations, and in comparison of the DBG-based tools with each other, we determined to make use of HERO’s MA/OG stage after exactly 3 DBG-based iterations, regardless of which DBG-based method was employed. Based on the results achieved here, we suggest to make use of 3 iterations of a DBG-based HEC method prior to applying the MA/OG stage of HERO in general, and stick to that scheme of iterations in the following.

Fig. 3

Indel error rates (y -axis) for PacBio CLR and ONT reads after correction with different protocols for different datasets (x -axis). Ratatosk, FMLRC, and LoRDEC refer to 8 iterations of the respective methods (pointed out as optimal protocol earlier); R-HERO, F-HERO, and L-HERO refer to 3 iterations of Ratatosk, FMLRC, and LoRDEC, respectively, followed by 5 iterations of HERO-OG. a The simulated PacBio CLR reads. b Both PacBio CLR and ONT reads from the 4 real datasets

As shown in Fig. 3b, we make use of the MA/OG stage of HERO only after threefold application of DBG-based HEC tools. In that, we iterate application of the MA/OG stage of HERO as well and monitor the resulting performance rates in terms of indel and mismatch content of TGS reads. As one can see in Fig. 3b, the indel error content drops rather dramatically already after the first iteration of the OG stage of HERO (note that one iteration, by definition, consists of 3 repetitions of steps 1–6 of the workflow of HERO; see Fig. 1). Beyond the first iteration of HERO’s OG stage, results keep improving, but only to minor (likely negligible) amounts, regardless of which DBG-based tool was used beforehand.

See Fig. 3d: when considering mismatches, one also observes a marked loss already after the first iteration of HERO’s OG stage. Beyond the first iteration, further improvements appear to be negligible, in any case minor with respect to the improvements that one achieves at the beginning.

Given that indels are the predominant type of errors that affect TGS reads and given that HERO induces a fairly drastic reduction of the indel error content of TGS reads, HERO arguably makes a substantial contribution to HEC of TGS reads.

As supported by these basic experiments, we henceforth decided to run 3 DBG-based pre-correction (indel removal) cycles, followed by 5 OG-based correction (haplotype-aware identification of errors). To juxtapose usage of MAs/OGs with using DBGs alone, we sometimes run 5 further DBG cycles instead of 5 MA/OG cycles, because the equal amount of iterations makes for an optimally fair comparison of alternative protocols.

For further, compact summaries of results on the basic 3 Salmonella data set, see also the right set of bars in Fig. 4 in the uppermost block of rows in Table 1. One can see that application of 3 times LoRDEC as DBG-based pre-correction stage followed by 5 times the MA/OG stage of HERO leads to a reduction from 6053 mistaken gap positions in the PacBio CLR reads to only 11.58 mistaken gap position per 100 kbp, that is, HERO as per its optimal protocol reduces indel errors by a factor of approximately 600. Note that applying LoRDEC, Ratatosk, or FMLRC alone leave one with 5.5 times more indels as the best result (8 iterations of LoRDEC). The advantages of HERO over DBG-only protocols are obvious across all the different state-of-the-art approaches LoRDEC, Ratatosk, and FMLRC. HERO has ten times better indel error rates over Ratatosk alone (11.77 versus 110.75 per 100 kbp) and 5 times better indel error rates when applying FMLRC alone (17.47 versus 83.26); for LoRDEC, see above (5.5 times better error rates).

Fig. 4

Indel error rates (y -axis) versus coverage (x -axis) for the different method protocols. Ratatosk, FMLRC, and LoRDEC refer to 8 iterations of the respective methods (pointed out as optimal protocol earlier); R-HERO, F-HERO, and L-HERO refer to 3 iterations of Ratatosk, FMLRC, and LoRDEC, respectively, followed by 5 iterations of HERO-OG (again pointed out as optimal protocol earlier). The figure shows that the advantages of HERO become more striking on decreasing coverage

Table 1 Benchmark results for correcting simulated PacBio CLR reads. Indels/100 kbp: average number of insertion or deletion errors per 100,000 aligned bases. Mismatches/100 kbp = average number of mismatch errors per 100,000 aligned bases. Genome fraction (GF) reflects how much of each of the strain-specific genomes is covered by the corrected reads. N/100 kbp denotes the average number of uncalled bases (Ns) per 100,000 bases in the read. MC = fraction of misassembled contigs

Results for mismatch rates are not as striking as for indel error rates. R-HERO yields 93.9/100 kbp mismatch errors, an improvement of more than 25% over the Ratatosk only (128.23 per 100 kbp). All other conceivable protocols leave at least 180 per 100 kbp mismatches in the TGS reads. R-HERO, F-HERO, and L-HERO refer to 3 iterations of Ratatosk, FMLRC, and LoRDEC, respectively, followed by 5 iterations of HERO-OG.

20 bacterial strains

This dataset consists of 20 bacterial strains, stemming from 10 species (so on average each species has two strains), at an average coverage of 10X per strain for both Illumina (NGS) and PacBio (TGS) reads; see the “Methods” for details. The dataset is supposed to provide a scenario of medium difficulty. See again Fig. 4 and Table 1. Application of R-HERO improves the indel error rate (26.54) by 9 times over the best DBG-only protocol (LoRDEC: 229.92), which again is a rather drastic improvement. As for mismatches, R-HERO improves the mismatch error rate (= 103.51) by 40% with respect to the best DBG-only protocol (Ratatosk: 171.31), which is still quite remarkable.

100 bacterial strains

This dataset consists of 100 strains from 30 species, at an average coverage of 10X per strain for both NGS (Illumina) and TGS (PacBio CLR) reads. The dataset is supposed to reflect a more complex scenario. The idea is to evaluate which of the possible protocols and methods potentially become confused by the complexity of the data. In fact (see again Fig. 4 and Table 1), the error rates indeed increase with respect to the two easier datasets just discussed. However, the optimal indel error rate achieved by R-HERO (46.10) stays approximately 9 times better than the best result achieved by DBG-only-based protocols (LoRDEC: 392.21), corroborating the clear advantages of HERO over the state-of-the-art. As for mismatches, R-HERO achieves only slight advantages (250.16) over prior approaches (Ratatosk: 267.42).

In summary, the advantages of HERO over prior approaches/protocols are most evident for indel error rates, with rates improving by a factor of 5 or even more. Advantages for mismatch rates are still clearly evident but are not as striking as for indels. In this vein, it may be important to remember that artificial indels are the predominant error affecting TGS reads.

As for the influence of numbers of species and strains on the results, HERO preserves its advantages across the different levels of complexity in comparison with prior methods (about 9 times less indels regardless of the level of complexity). However, from an absolute point of view, the error rate increases on increasing complexity: from 11.58 for low via 26.54 for medium to 46.10 for more difficult datasets. In contrast, the error rates increase from 59.0 via 229.92 to 392.21 for the prior DBG-only methods. To complete the picture, we recall that the initial indel error rate of the raw reads was greater than 5000 for all datasets.

Strain-mixing spike-in datasets: the influence of coverage

For investigating the effect of the sequencing coverage of the NGS reads on error correction, we considered 6 “strain-mixing spike-in” datasets. These datasets are characterized by mixing simulated reads of 10 known Salmonella strains with real sequence and then correct the errors in the simulated reads; note that we know the ground truth only for the simulated reads, while the majority of reads is real, which is the idea of such datasets. The 6 datasets varied in terms of the coverage of the “spiked-in” NGS reads, which ranged from 5X to 30X in steps of 5X, across the 6 datasets. The average coverage of PacBio CLR reads for the 10 Salmonella in the 6 datasets amounted to 10X.

See Fig. 5 for results with respect to indel and mismatch error rates, and see also Additional file 1: Table S1 for full details. If the coverage is below 15X, the reduction of the indel error rate of HERO is particularly noticeable. Results are particularly striking for a coverage of only 5X of the NGS (Illumina) reads. Here, application of L-HERO yields a 6-fold improvement over the best DBG-only method (LoRDEC): 39.09 by L-HERO versus 246.33 by LoRDEC.

Fig. 5

Indel error rates for different methods stratified by the strains of the real datasets (1st column: NWC = Genbank ID, Bmock12 = IMG Taxon ID), which are ordered in descending order by their coverages (2nd column). Error rates are colored from large (red) via medium (white) to low (blue). The evident trend are improvements in indel error rate relative to increasing coverage

On increasing NGS read coverage, all methods, DBG-only and HERO, improve, and differences become less striking as for low coverage. However, even if the coverage of the NGS reads is large, HERO still induces substantial improvements. At 30X, the indel error rate of L-HERO reduces that of LoRDEC by 24.49% (LoRDEC: 9.8 versus L-HERO: 7.4). Note that the mismatch error rate of FMLRC is the best among all DBG-only approaches: 107.8 versus 208.3 by LoRDEC, at a coverage of 30X. Although F-HERO corrected mismatch error is close to FMLRC (F-HERO:102.87 versus FMLRC:107.8), F-HERO reduces indel errors by up to 37.93% over FMLRC only (F-HERO: 12.11 versus FMLRC: 19.51).

Experiments: real metagenomic datasets

We further evaluated all approaches on the 4 real datasets “Bmock 12 PacBio”, “Bmock 12 ONT,” “NWC PacBio,” and “NWC ONT.”

Bmock12 ONT

The “Bmock12” dataset contains 11 strains from 9 species, which is of rather low complexity. The two species that exhibit more than 1 strain are Marinobacter and Halomonas.

See Table 2 for the following results. All methods achieve relatively low error rate. As supported by our experiments on the spike-in datasets with respect to coverage, the most plausible explanation is the fact that the NGS read coverage was relatively large (15X to 618X), in combination with the fairly low complexity of the dataset. Still, however, application of HERO exhibited substantial improvements over the results that DBG-only-based methods achieved. For example, HERO reduced the indel error rate by on average 48% (27\(\sim\)79%) in direct juxtaposition (e.g., FMLRC versus F-HERO) with the DBG-only protocols. L-HERO outperformed the other approaches in particular―its indel error rate is 37.9% lower than that of LoRDEC only: 5.03 by L-HERO versus 8.10 by LoRDEC. As for mismatch errors, L-HERO reduces errors by only 9% in comparison with LoRDEC: 20.54 by L-HERO versus 22.56 by LoRDEC.

Table 2 Benchmark results for correcting reads from real datasets. Indels/100 kbp: average number of insertion or deletion errors per 100,000 aligned bases. Mismatches/100 kbp = average number of mismatch errors per 100,000 aligned bases. Genome fraction (GF) reflects how much of each of the strain-specific genomes is covered by the corrected reads. N/100 kbp denotes the average number of uncalled bases (Ns) per 100,000 bases in the read. MC = fraction of misassembled contigsBmock12 PacBio

For maximal comparability, we used the same Illumina reads as for ONT to also correct the PacBio CLR reads. We found that the error rate of the uncorrected TGS reads affected the quality of the correction. Note that the indel error rate of the Bmock12 PacBio CLR data is 60% larger than that of the ONT reads (4001.58 by ONT versus 7790.78 by PacBio CLR). Although we used the same Illumina reads to correct the PacBio CLR reads, more indel errors were retained in comparison with the ONT reads. Apart from that difference, results are largely analogous to the results obtained for ONT reads. In particular, after correction with Ratatosk, the PacBio CLR reads still exhibited an indel error rate of 335.23 per 100 kbp, which is 10 times larger than Ratatosk on ONT reads (31.17). As for levels of improvement of HERO over DBG-only methods, results scale similarly in a relative comparison. For example, L-HERO achieves an indel rate of 18.95 versus 40.99 by LoRDEC, so it reduces the errors by 50% over the DBG-only approach. We recall that for ONT reads, HERO achieved a relative improvement of 37.9% (see above: 5.03 versus 8.10).

NWC

This dataset contains 3 species (Streptococcus thermophilus, Lactobacillus delbrueckii, Lactobacillus helveticus), of 2 strains each, of ANI 99.99%, 99.24%, and 98.03%, respectively. Because of the high ANI, this dataset is particularly challenging in terms of strain diversity; see Additional file 1: Table S3 for additional information. Although we removed the low quality bases (> Q20), the NGS (Illumina) reads retained an unusually high error rate (indel error rate: 19.58/100 kbp; mismatch error rate: 883.28). This does not reflect standard scenarios and can have a considerable impact on the quality of the correction. To re-establish a standard scenario, we corrected the NGS reads prior to hybrid correction of the TGS reads, by using bfc [40], an approved error corrector applicable for Illumina short reads.

NWC ONT

Here, we discuss the results from using ONT reads as TGS reads, the coverage of which was 94.57X. As one can see in Fig. 4b and Table 2, the correction results for the NWC datasets were worse than for Bmock12 and the simulated datasets. The obvious explanation is the difficulty of this dataset in terms of the similarity of the strains (as expressed by large ANIs, see above). Moreover, the initial error rate of the raw ONT reads was substantially larger than for Bmock12.

Still, HERO reduced the indel error rate by more than 50% (a factor of 2) in direct comparison with the toughest competitor (L-HERO: 72.05; LoRDEC: 154.4), which means a remarkable reduction nevertheless. In particular, this means that the indel error rate drops to less than 1% of that of the raw reads (7991.02). As for mismatch correction, F-HERO outperformed the other methods, reducing it by more than 20% in comparison with the best DBG-only protocol (F-HERO: 270.36; FMLRC: 340.97). It also means that mismatch errors drop to approximately 5% of the original value (5476.30).

NWC PacBio

See Table 2, L-HERO outperformed all other approaches in terms of indel error correction, and drops the error rate by 31.15% in direct comparison with the best DBG-only approach (L-HERO: 54.68 versus LoRDEC: 79.42). In terms of correcting mismatches, F-HERO outperforms the other approaches. In direct comparison with FMLRC, as the toughest DBG-only approach, F-HERO achieves a reduction of 28.52% (F-HERO: 171.05; FMLRC: 239.31).

NGS read coverage

To evaluate the effect of NGS (here: Illumina) read coverage, we analyzed how performance in the 4 real datasets varied relative to variations thereof; see Fig. 6 and Additional file 1: Fig. S3. Note that the NGS read coverage of the 6 strains in the NWC dataset is not even but ranges from 10.27X to 56.29X. We therefore just stratified correction results to the individual 6 strains and related individual correction rates to the individual coverages of the strains. We applied the same stratification analysis for the 11 strains making part of Bmock12, whose coverages range from 14.91X to 618.76X.

Fig. 6

Steps 1 and 2 of the workflow displayed in Fig. 1. Step 1, “Short reads align long reads,” leads to alignments of the long read to be corrected with short reads. Given the alignments, putative variant positions P1:X, P2:X, and P3:X are identified (“X” indicates mismatch). Step 2, “Phase and filter out reads,” inspects the sequence content of both the long and the short reads at each of these positions. Different scenarios are conceivable. In P1:X, all short reads agree on “A,” so no short reads are removed because of this position. In P2:X, there are large amounts of short reads for supporting both “A,” agreeing with the long read, and “C,” disagreeing with the long read. Because the number of short reads in disagreement with the long read is too large, these short reads are removed (“filtered out”); most likely, they stem from a different haplotype. We now refer to this position as P2:M with “M” meaning “match,” indicating that there is nothing further to be done. In P3:X, all but one short read support “G.” Because the amount of short reads (exactly one) that do not support “G” is too small, these short reads are kept; most likely, they contain a sequencing error. After these two steps, P1:X and P3:X require further attention, while P2:M has been resolved

The first and obvious observation is the fact that the error rates correlate negatively with NGS read coverage: the larger the coverage and the lower the error rate. The second observation is that the general trends with respect to differences between the datasets Bmock12 and NWC persist in this analysis: results are generally better for Bmock12, owing to the reduced level of difficulty in terms of the similarity of the strains in Bmock12.

As for the negative correlation between error rates and coverage, see Fig. 6, where F-HERO is an exemplary protocol: the lower the coverage, the higher the error rates, as in NWC, where each strain is partnered by another one from the same species. This correlation is disturbed for Bmock12, because the majority of the 11 strains has no counterpart, which renders error correction substantially easier for them; this then further has a clear influence on the error rates. Still, however, as a look at the exemplary F-HERO points, the negative correlation gets established also there (note that correlation also exists for R-HERO and L-HERO, albeit with some fluctuations).

As for examples of relations of coverage and error rates, note that in the Bmock12 dataset, the indel error rate drops substantially as soon as coverage rates exceed 30X for the NGS reads. This effect can be observed across both datasets, and all methods, both HERO and DBG-only type.

Improving genome assembly

The arguably canonical area of application that decisively depends on sequencing reads that are both long and sufficiently free of errors is de novo genome assembly. It also stands out insofar as reads are its only input. We therefore pay particular attention to the effects of our hybrid-hybrid error correction framework on the assemblies that one obtains. For maximum comparability, we concentrated on Canu [20] in particular as an assembly tool that (1) reflects the state-of-the-art as a very popular current tool and (2) offers an option to compute assemblies from PacBio HiFi reads (“HiCanu”) [41]. As for (2), our results demonstrated that our corrected TGS reads are substantially lower than the error rates of PacBio HiFi reads—on a side remark, note that this establishes a major advantage over PacBio HiFi reads themselves, as the length of PacBio CLR and ONT reads is on average two and four times greater, respectively. This justifies to run HiCanu on our corrected reads. Subsequently, one can compare the assemblies using raw reads using Canu itself (which integrates an error correction module in its own right) with the assemblies using the corrected reads using HiCanu. Furthermore, we considered Hifiasm-meta [24], which cannot process the uncorrected reads, which does not allow a comparison on the improvement of assemblies before and after correction of errors. We considered Hifiasm-meta mostly to demonstrate that the improvements we claim to make do not depend on a particular assembler.

To highlight the particular advantages of our hybrid-hybrid error correction protocol, we further focused on scenarios that exhibit multiple, closely related genomes. Therefore, we considered the 4 real datasets that we considered earlier (Bmock12-PacBio, Bmock12-ONT, NWC-PacBio, NWC-ONT).

See Table 3 for results on Canu/HiCanu and Table 4 for results on Hifiasm-meta. The first observation is that HERO induces considerable improvements of the assemblies over assemblies based on DBG-only hybrid error corrected reads, both in terms of genome fraction and error content. When running HiCanu, HERO also yields improvements in terms of the length of the contigs, summarized here by the NGA50. When running Hifiasm-meta, there are no improvements in terms of sheer length (while the considerable improvements in terms of genome fraction and error content persist, as above-mentioned).

Table 3 Results on assembly after correction by different correction protocols of reads from 4 different real datasets. 1st column: genome assembly program; Canu was used for raw reads, while HiCanu was used for corrected reads (because their error profiles resemble those of PacBio HiFI reads). 2nd column: Error correction protocol. Indels/100 kbp: average number of insertion or deletion errors per 100,000 aligned bases. Mismatches/100 kbp = average number of mismatch errors per 100,000 aligned bases. Genome fraction (GF) reflects how much of each of the strain-specific genomes is covered by the corrected reads. NGA50 is the length of the longest contig such that the alignments of that and all longer contigs span at least 50% of the reference sequence. N/100 kbp denotes the average number of uncalled bases (Ns) per 100,000 bases in the read. MC = fraction of misassembled contigsTable 4 Results on assembly using Hifiasm-meta after correction by different correction protocols of PacBio CLR and ONT reads from 4 different real datasets. 1st column: genome assembly program = Hifiasm-meta. 2nd column: Error correction protocol. Ratatosk, FMLRC, and LoRDEC refer to 8 iterations of the respective methods (pointed out as optimal protocol earlier); R-HERO, F-HERO, and L-HERO refer to 3 iterations of Ratatosk, FMLRC, and LoRDEC, respectively, followed by 5 iterations of HERO-OG. Indels/100 kbp: average number of insertion or deletion errors per 100,000 aligned bases. Mismatches/100 kbp = average number of mismatch errors per 100,000 aligned bases. Genome fraction (GF) reflects how much of each of the strain-specific genomes is covered by the corrected reads. N/100 kbp denotes the average number of uncalled bases (Ns) per 100,000 bases in the read. MC = fraction

View original article

GENOME BIOLOGY

Like

Share Bookmark

0 0 0 0 0 0 0

Comments (0)

Login or register to post your comment

More from this channel

Hybrid-hybrid correction of errors in long reads with HERO

Comments (0)