Yclon: Ultrafast clustering of B cell clones from high-throughput immunoglobulin repertoire sequencing data

The use of next-generation sequencing in Adaptive Immune Receptor Repertoire (AIRR-Seq) studies provides only a snapshot of the antibody repertoire diversity that is composed of up to 1018 distinct antibodies (Briney et al., 2019). Moreover, those advanced techniques help to better understand the immune system as it allows the study of millions to billions of distinct antibody sequences (De Witt et al., 2016; Briney et al., 2019). To effectively interpret this vast amount of data, a crucial step is the grouping of B cell clones. These clones are derived from a common progenitor B cell (Hershberg and Luning Prak, 2015) and typically bind to the same epitope. The B cell receptor (BCR) is generated by the rearrangement of V (D) and J gene segments, with the CDR3 located at the junction of these segments playing a critical role in antigen recognition and binding specificity. CDR3 sequence analysis can identify B cell clones with shared antigen specificities, making it a powerful tool for characterizing the B cell repertoire. The size of a B cell clone and the number of clonotypes (each clone is only counted once) can provide insights into repertoire diversity, which is related to different factors such as age (Davydov et al., 2018), response to vaccination or infection (Khavrutskii et al., 2017), allergy (Wu et al., 2014), cancer (Zhang et al., 2019), among other conditions. However, identifying antibody sequences belonging to the same B cell clone remains a challenge (Gupta et al., 2017), particularly with such large datasets.

Some researchers consider the same clone antibody sequences sharing the same V and J genes and identical junction region (CDR3 plus the conserved anchors at positions 104 and 118), as the case of IMGT/HighV-QUEST (Li et al., 2013; Aouinti et al., 2015). In general, there is a consensus that the V and J genes have to be shared by antibodies to be considered as part of the same clonotype. The main difference in the clonotyping methods is the way and the cut-offs used to group the antibodies junction region or the CDR3. Some approaches like the one containing in Change-O (Gupta et al., 2015; Gupta et al., 2017) and SCOPer (Nouri and Kleinstein, 2018; Nouri and Kleinstein, 2020) assigns clones based on the hamming distance of junction, then performs a clustering method (single-linkage for Change-O and spectral clustering with an adaptive threshold for SCOPer). One problem with most of the above approaches is the time to process high-throughput data, with negatively impacts downstream analyses, especially in limited resources laboratories. Lindenbaum et al., 2020 proposed a method to group Ig sequences into clonotypes, that does not require gene assignments and is not restricted to a fixed junction length. What it does is vectorize a sequence of 150 nucleotides, covering the junction, into k-mers of size 5, which should be fast. However, they do not take into account if the sequences are annotated as sharing the same V and J genes. Therefore, it is possible that the clones assigned by Lindenbaum et al., 2020 do not fit the definition of clone hereby established.

The increasing availability of BCR sequencing datasets from different sources presents an enormous potential for in-depth analysis. In order to ensure consistency and quality in reporting this data, the Adaptive Immune Receptor Repertoire (AIRR) Community has proposed the Minimal Information about AIRR (MiAIRR) standard (Rubelt et al., 2017), which outlines the essential information that should be included in reporting these datasets. In this sense, the databases OAS and iReceptor, were created, and gathered, 3,448,993,669 and 815,114,273 IGH sequences, respectively, on May 17th, 2023 (Corrie et al., 2018; Olsen et al., 2021). Notably, 77.7% of IGH sequences in OAS belong to repertoires consisting of over a million sequences, while iReceptors exhibit an even higher percentage at 82%. This shows the need for a more scalable tool, able to clonotype hundreds of antibody repertoire datasets, including large repertoires, quickly and efficiently while respecting the definition of clone.

Because of this, we propose YClon, a tool that rapidly processes repertoires to identify clonotypes, including the ones with more than 2 million sequences, in less than one hour, considering identical V and J genes and similarity in the CDR3.

留言 (0)

沒有登入
gif