Comparing methods for constructing and representing human pangenome graphs

In this article, we provide a comprehensive view of whole-genome human pangenomics through the lens of five methods that each implement a different graph data structure: Bifrost, mdbg, Minigraph, Minigraph-Cactus, and pggb. We examine several features of pangenome graphs, in particular their scalability and how they represent genetic diversity. To this end, we collected all publicly available high-quality human haplotypes and attempted to construct pangenomes of various complexity with each selected tool. Although vg has been widely used as the basis of relevant pangenome-based discoveries, for example on fast and accurate short read mapping [4], we decided to not consider it in our analysis for two main reasons: the bias introduced by the reference sequence that is used as the backbone of the graph (and associated to the vcf) together with the limited capacity of this method to integrate structural variations from many genomes. We believe both aspects are drivers of the use of pangenome graphs.

Scalability and characteristics of pangenome graph construction tools

We ran the above five tools on three datasets consisting of 2, 10, and 104 human haplotypes, respectively (Table 3). We compared the computational performance of construction algorithms as well as characteristics of the produced pangenome graphs. The goal is to assess the ability of each method to scale to data available in the near future, i.e., thousands or even millions of human genomes [5].

The performance of each tool is evaluated in terms of running time, peak memory, disk space required by the output data structure (graph and annotations). We also compared the number of nodes, edges and connected components as indicators of the complexity of the graph. Results are displayed in Table 1.

In terms of running time, mdbg is two orders of magnitude faster than other tools on all considered datasets, taking around two minutes on the H2 dataset and half an hour on H104. Bifrost is the second fastest on H104 (18 hours), and Minigraph is the second fastest on H2 (8 minutes). Minigraph-Cactus takes one order of magnitude more time than Minigraph. We could not obtain graphs for pggb and Minigraph-Cactus on H104 as for the first the execution did not finish after 2 weeks and the second returned an error.

In terms of memory usage, mdbg consistently uses less than half the memory of other tools (31 GB on H104), followed by Minigraph (61 GB on H104). On H2 all tools used between 8 and 66 GB of memory.

All tools used reasonable disk space to store the resulting graph, \(\le 12\) GB for H10 and \(\le 38\) GB for H104. Although Minigraph-Cactus and pggb retain all variations and are the only two tools able to reconstruct the input haplotypes directly from the graph, they are the second and third most efficient in term of disk space (for Minigraph-Cactus, 3.6 GB on H2 and 7 GB on H10). While Bifrost and Minigraph perform all computation in memory, pggb, Minigraph-Cactus, and mdbg store intermediate files on disk, taking comparable space to the input size (up to 3× for Minigraph-Cactus).

Different tools yield different pangenome graphs topologies

Graph metrics such as the number of nodes, edges and connected components provide useful insights on the level of detail of the represented variations and on the complexity and accessibility of the information inside the pangenome.

The number of graph nodes varies between 17,000 and 11 millions for the H2 dataset across all tools. In all cases, the number of nodes is at least 3 orders of magnitude smaller than the number of bases in the haplotypes, indicating that pangenome graphs are effective at compressing linear parts of the haplotypes. Tools which discard variations (Minigraph and mdbg) yield in the order of \(10^4\)–\(10^5\) nodes across all datasets, while tools which retain all variation (Bifrost, Minigraph-Cactus and pggb) yield in the order of \(10^6\)–\(10^7\) nodes. In all cases going from the H10 dataset to the H104 dataset increases the number of nodes by 5x, indicating that graph complexity grows sublinearly with the number of added haplotypes.

The number of connected components varies between 2 and 1402 across all methods and datasets, and the number of large components (i.e., those with more than 1% of total base pairs) varies between 1 and 30. If chromosomes were separated perfectly, pangenome graphs should contain exactly 24 connected components (one per nuclear chromosome, excluding mitochondria). Minigraph produces 24 large connected components as the number of chromosomes in the reference CHM13 v2.0 (25 including mitochondria). Bifrost and Minigraph-Cactus yield graphs with less than 25 connected components while mdbg and pggb have more than 25. In the Bifrost dBG, the vast majority of sequences (> 99.99%) are in a single giant component, as chromosomes are joined because they share common k-mers. In mdbg, such joining does not occur on dataset H2, which has 24 large enough components (each containing > 1% of bases), possibly due to the absence of long and similar enough regions between chromosomes. Minigraph does not map any mitochondrial sequence from the input haplotypes to the reference, while they do get included into Minigraph-Cactus graphs.

Even if it is common practice to analyze pangenomes chromosome by chromosome [8, 17], in this analysis we purposely used entire genomes as input instead. This was done for two reasons: i) to highlight the scalability of the tools, and ii) because separating chromosomes prevents the identification of inter-chromosomal inversions, translocations, and transposable elements, even if most of the generated inter-chromosomal events are probably alignment artifacts. The effects of this choice can be seen in the pggb and the Minigraph-Cactus H10 variation graphs of Fig. 1. In the pggb graph 19 chromosomes are linked into a single giant component, while chromosomes 17, 18, 20, X, and Y are in other large components. This giant component consists of 25 M nodes that contain 83% of the total basepairs. The remaining 859 components represent only 4.7% of the total bases due to small sequences in the input haplotypes. In the Minigraph-Cactus graph all chromosomes are linked into a single giant component except chromosome 18 that is in a separate component, and the sexual chromosomes (X and Y) that are connected together into another component.

Fig. 1figure 1

The complete pangenome construction scheme and visualization. A The overall workflow, using 5 different tools on 3 different datasets; B complete 104 haplotypes variation graph built by Minigraph; C focus on part of HLA (MHC) region in chromosome 6 from panel B; D focus on DRB1-5 locus of HLA from panel C; E, complete 10 haplotypes variation graph built with pggb; F 10 haplotypes variation graph built with Minigraph-Cactus; G 104 haplotypes pangenome mdbg; H 10 haplotypes Bifrost dBG. All graphs except those produced by Minigraph have been simplified using gfatools and rendered using Bandage. VG is for variation graph

Table 1 Time, memory, final disk space, nodes, edges, total connected components and connected components with more than 1% of base pairs comparison of Bifrost, mdbg, pggb, Minigraph and Minigraph-Cactus for different number of haplotypes in input. Minigraph-Cactus times include the Minigraph graph construction step. pggb was not able to complete its execution on the largest dataset in more than 2 weeks thus it is not considered. Minigraph-Cactus failed to compute the 104 HAP datasetInterpretation of variation in pangenome graphs: focus on two HLA loci

The ability to detect and annotate variations among input haplotypes defines the scope of each pangenome graph construction method. Previous work [18] recommends to build graphs on a specific loci rather than the entire genome for the purpose of i) identifying genomic diversity and ii) mapping raw reads to divergent regions, specifically difficult-to-map repeats. Here we evaluate how pangenomes built from entire haplotypes represent specific biologically relevant loci.

Extraction of HLA-E and a complex HLA region from complete pangenome graphs

We extracted from complete pangenomes the regions corresponding to two loci of the Human Leukocyte Antigen complex, also known as HLA. These regions are highly medically relevant as they contain many disease-associated variants [19]. The first locus is the HLA-E gene, that is part of the nonclassical class I region genes, spanning 4.8 kbp, and is relatively conserved across populations. It has been shown to have a significant association with hospitalization and ICU admission as a result of COVID-19 infection [20]. The second is an HLA complex region comprising the HLA-A gene, part of the classical, highly polymorphic class I region. It is around 58 kbp long and contains the HLA-U, HLA-K, HLA-H, and HCG4B genes. We extracted these two regions from pangenome graphs using a custom script that yields a subgraph corresponding to a given set of sequences and their variation. The script uses a different recommended method for each of the pangenome graph types. In a nutshell, we extracted regions using exact coordinates when possible and resorted to sequence-to-graph alignment otherwise (see AppendixLoci extraction method” section for details). While on variation graphs and mDBGs nearby nodes of an aligned region correspond to variations of the locus, this is not always true for standard dBGs. Extracting accurate and complete loci representation is an unsolved challenge for dBGs.

HLA-E: a low complexity region

Figure 2 shows how the different tools represent HLA-E over datasets H2, H10, and H104. As expected, Minigraph does not detect any variation, since the SNPs that characterize the region are too small to be considered in the construction steps of their algorithm. pggb, on the contrary, has 2 SNPs in H2 and 3 in H10. Bifrost detects the same SNPs as pggb in H2 and H10. Both of them represent the exact same variations and render the same haplotype paths. mdbg captures the heterozygosity of a large region containing the HLA-E locus as the number of samples grows. As the mdbg graph is built in minimizer space, nodes represent long genomic segments (in the order of hundreds of thousand of base-pairs). In H10 and H104, the minimizer-space representations of the haplotypes are identical; however, differences in flanking regions of the graph create variations that are captured in extra nodes that are also extracted in this region. On H2, Minigraph-Cactus detects 3 variations as the dataset used is different, containing the CHM13 reference and just one haplotype of HG006 (as in Minigraph), as discussed in the “Datasets and haplotypes collection” section.

Fig. 2figure 2

Representations of the HLA-E locus by five graph construction methods over three increasing large human pangenomes. Nodes highlighted in red contain part of the locus sequence. The number of nodes and edges displayed below each graph concerns the whole subgraph (both red and gray nodes). Minigraph, on H2, H10, and H104, and mdbg, on H2, have only a portion of one node highlighted since the 4.8 kbp region is contained inside a single, long node

Figure 2 also illustrates how pangenome complexity grows with the number of genomes: the Bifrost H104 subgraph has the most variation across all methods, highlighting that dBGs represent variations exhaustively in large graphs. On the other hand, pggb has the most straightforward method for extracting subgraphs, and also represents variants exhaustively in datasets H2 and H10, but could not scale to the H104 dataset.

HLA complex locus: high complexity region

Figure 3 is the counterpart of Fig. 2 for the complex locus part. In this case, the overall interpretability of the region is more challenging, as the number and the structure of the variations is different than in HLA-E. It is also more difficult to compare across tools. Base-level variations, e.g., SNPs, are not visually recognizable in Fig. 3 in the methods that retain them (i.e., pggb, Minigraph-Cactus, and Bifrost) due to the large sizes of graphs.

Fig. 3figure 3

Representations of the complex HLA region by five graph construction methods over three increasing large human pangenomes. See caption of Fig. 2 for details

There are notable differences in how tools represent the variation, which is well-illustrated in the H2 dataset. While Minigraph renders H2 as a single sequence plus a large structural variant (SV) of \(\approx\) 52 kbp, pggb separates it into two paths that differ by \(\approx\) 54 kbp in length. Bifrost represents a detailed bubble that contains many variations inside each path. In mdbg, even extracting the complete locus is a challenge as many of the subgraph nodes were not selected by our procedure. Minigraph-Cactus adds base level divergences between haplotypes on top of Minigraph SV graph.

These differences between representations are further accentuated in the H10 dataset. For it, pggb tends to separate the haplotypes into different paths, Bifrost renders consistently the same compacted representation and Minigraph neglects most of the small differences but is able to display accurately the general picture, and Minigraph-Cactus, as in H2, adds small variations on top of Minigraph structure.

Uncovering characteristics of graphical pangenome tools

The data structures generated by pangenome building tools are expected to facilitate comparisons between the input genomes. In addition, pangenome graphs should be stored in such a way to be easily used by downstream applications. We identify 8 important features for pangenome graph construction tools: (i) stability, (ii) editability, (iii) accessibility by downstream applications, (iv) haplotype compression performance, (v) ease of visualization, (vi) quality of metadata and annotation. Two other but important features, scalability, and interpretability of produced graphs, were already discussed in the “Scalability and characteristics of pangenome graph construction tools” and “Interpretation of variation in pangenome graphs: focus on two HLA loci” sections. Table 2 summarizes some of the following considerations on the relative strength of the tools.

Editability and dynamic updates

As more high quality assemblies will be generated in the near future, haplotypes may be added to a pangenome, or replaced by improved versions. Updating an existing data structure instead of rebuilding it from scratch is both computationally and energetically efficient. However, many succinct data structures currently used in pangenome representation are static, i.e., cannot be updated. Some methods allow a restricted set of editing operations. Minigraph allows to add new haplotypes on top of an already built graph. Bifrost provides C++ APIs to add or remove (sub-)sequences, k-mers and colors from the ccdBG. pggb, using odgi [21], allows specific operations that delete and modify nodes and edges and add and modify paths through the graph. As Minigraph-Cactus can be opened with odgi, it supports the same operations as pggb. The current mdbg implementation uses a dynamic hash table, but does not expose an interface that supports updates.

Stability

Counter-intuitively, a pangenome graph construction tool may in some cases generate different outputs when executed multiple times with the same haplotypes as input. This unstability could be due to a permutation in the order of the sequences given as input, or non-determinism in the construction algorithm. Yet in order to facilitate the reproducibility of results, pangenome building tools should generate an unchanged output from the same set of input sequences, independently of the particular run or the order in which these are given. We performed two tests to evaluate tool stability: (i) we run the tools 3 times using as input the same H10 dataset and ii) we run the tools twice on shuffled input sequences, i.e., changing the order of the haplotypes of H10.

Bifrost and mdbg constructed exactly the same pangenome on every test, as by definition, de Bruijn graphs are stable. Minigraph generates identical graphs on identical inputs, but generates slightly different graphs when the input is permuted. Indeed the construction algorithm of Minigraph is order-sensitive as it augments the existing graph structure by aligning the next given haplotype to it and adding divergent sequences. Minigraph-Cactus generates slightly different graphs on identical input. pggb generated slightly different graphs while maintaining the same haplotype sequences in the paths. The overall representation of the input genomes is therefore preserved, while the topology of the variation graph varies. The first two of the three phases of the pggb pipeline (all-vs-all alignment and graph imputation) produce the same result on different runs with the same input but differences arise when the order of the input haplotypes changes. Most of the differences in the graph topology are thus due to the final smoothing steps.

Accessibility by downstream applications

To facilitate their adoption, pangenome representations should be easily processed by downstream analyses. De Bruijn graphs are challenging to analyze due to their high number of nodes, edges, and redundancy (the \(k-1\)-overlaps between nodes). Though De Bruijn graph representations usually support queries of presence/absence on nodes (as in Bifrost), they lack tools able to perform more elaborate analyses such as those discussed in the “Interpretation of variation in pangenome graphs: focus on two HLA loci” section, e.g., incorporating haplotype information at the k-mer level. On the other hand, variations graphs with paths provide more flexibility, i.e., as in pggb and Minigraph-Cactus with the odgi visualization toolkit. Finally, in Minigraph, which considers a narrower spectrum of variants, the absence of path information prevents haplotype-level analysis; haplotypes would need to be manually mapped back to the graph. The choice of the pangenome building tool depends on the envisioned application. pggb and Minigraph-Cactus graphs have been shown to outperform linear references for short-read mapping, genotyping, and RNA sequencing mapping [8]. As these two methods are complex pipelines based on multiple tools where parameters have been carefully set, they can be more challenging to install and run than single integrated tools. Minigraph alone can also be used if the focus is on structural variation instead of SNPs or small indels, and to quickly produce a pangenome graph for complex loci visualization and interpretation. The dBG-based approaches show that, for example with Bifrost, they retain the same base-level information as more computational-heavy variation graph approaches, but the lack of tools to use them for analysis limits their adoption.

Haplotype compression

Building a graph pangenome can be seen also as a way to store, compact and retrieve the input haplotypes. As the number of new assemblies is growing faster than the data storing capacity, pangenomes can potentially help save storage space. This is highlighted by the disk space reported in Table 1, which is consistently smaller than the sum of haplotype sizes for all methods and datasets.

In order to losslessly retrieve the input genomes from a pangenome, the representation has to store all variations from the original haplotype sequences as paths in the graph. pggb and Minigraph-Cactus fall into this category while the other three considered tools do not store paths, or do not consider all variations, thus they are lossy.

Of note, the GBZ tool [9] enables graph pangenomes that store paths in the GFA file format to be stored in a lossless compressed form. It uses a Graph Burrows-Wheeler transformation to compress the graph in a more efficient way than using gzip [9]. Using GBZ, the pangenomes generated by pggb and Minigraph-Cactus are losslessly compressed with space gains of 3.5–5×.

Ease of visualization

Visualizing large graphs which exceed hundreds of thousands of node is a challenge that exceeds the scope of pangenomics. The H104 pangenomes are difficult to visualize. Among the visualization tools considered by the Human Pangenome Reference consortium [6], only Bandage is able to display the Minigraph or mdbg H104 graphs, which contains a few million nodes. We reduced the number of nodes and edges of pggb, Minigraph-Cactus and Bifrost H10 graphs by collapsing isolated subgraphs representing SNPs or indels up to 10k bp (using the command gfatools asm -b 10000 -u).

Quality of metadata and annotation

Augmenting pangenome structures with information from other omics data would increase pangenome relevance in biological discoveries. As biobanks are rapidly growing, more data is available on regulatory regions, transcriptomics, CNVs and other medically relevant traits [22, 23]. Pangenome data structures could leverage such information and some of the considered tools offer basic functionality in this sense. Bifrost provides a function to link data to graph vertices through C++ APIs. pggb and Minigraph-Cactus, using odgi, offer annotation capabilities through insertion of paths or BED records. Minigraph and mdbg do not offer any annotation feature. Specifically, in order to enhance a pangenome graph with metadata (for example with genes and regulatory regions known variants), it is desirable to maintain compatibility with methods and data formats that use a linear reference. One could conceivably project data from a graph to a reference genome to continue downstream analyses using linear coordinates. A simple method to achieve this compatibility, in our view, is to store the reference genome of interest inside the graph pangenome that supports retrieving such a reference. Variation graphs built using pggb or Minigraph-Cactus, due to their locally acyclical and directed construction and their use of haplotype paths, store all the coordinates needed for such a task. Haplotype paths play an important role as they avoid additional mapping to the graph, by using the odgi tool to extract or inject the required information. Minigraph does not store haplotype paths and requires mapping sequences to the graph to restore haplotype information. On the other hand, De Bruijn graphs, using associated color data, can record the membership of k-mers to a reference sequence, yet one cannot fully reconstruct a haplotype unless k-mers positions are also stored.

Table 2 Relative strengths of five pangenome graph construction tools. Explanation of rows: (1) efficacy of construction algorithm, measuring wall-clock time; (2) degree to which variants (e.g., SNPs) are retained; (3) ability of a tool to perform well on large datasets, both in comparison to other tools but also compared to smaller datasets; (4) ability to modify the produced data structure to add or remove haplotypes; (5) property of producing the same result irrespective of perturbations, such as permutation of the input order, and repeating the same run; (6) existence of tools (and operations) that can be applied to the resulting graphs; (7) whether input haplotypes information is retained by the tools, and if so, its space efficiency; (8) whether the entire graph can be directly visualized and interpreted; (9) easiness of “zooming in” a specific genomic region and interpret variants; (10) summarizes the functionalities provided by the tools to annotate the pangenomes with genomic data; (11) ability to share information between the graph and a linear reference

留言 (0)

沒有登入
gif