Comparative genotyping of SARS-CoV-2 among Egyptian patients: near-full length genomic sequences versus selected spike and nucleocapsid regions

Globally, as of 25 May 2023, 15,608,522 SARS-CoV-2 genomic sequences have been submitted to GISAID. In the current work, as of 10 April 2023 (date of access), a total of 3935 SARS-CoV-2 genomic sequences submitted by Egypt since the start of the pandemic and over 3 years (March 2020–March 2023) were downloaded from the GISAID database. We aimed to highlight the shift in lineage assignment across wave patterns in Egypt.

The coronavirus genome averages approximately 29 (26–32) kb, which is identified as the largest genome size for an RNA virus [31]. Indeed, whole genome sequencing is the best option for lineage assignment; however, few drawbacks, such as high cost and time consumption, exist in low-resource countries such as Egypt. Therefore, sequencing of smaller regions instead of the whole genome is considered more feasible under these circumstances.

According to the official site of the Egyptian Ministry of Health (https://www.care.gov.eg), the first pandemic wave in Egypt was from April 2020 to September 2020, while the second pandemic wave was from October 2020 to March 2020. The peaks of the first and second waves of COVID-19 in Egypt were in mid-June and late December 2020, respectively. The first wave was dominated by B variants, especially B.1, similar to other parts in the world at that time. During the following second and third waves, a shift in lineage was observed in the C.36 and C.36.3 lineages. Then, the delta variant (B.1.617.2) and omicron variant (B.1.1.529) became dominant during the fourth and fifth waves. Interestingly, there was a current shift in prevalence of circulating lineages from dominant nonrecombinant forms such as B.1 and C.36.3 to recombinant forms such as XBB.1.9.1. These recombinant forms were circulating at low levels during the first year of the pandemic [32]

Several software tools have been developed specifically for SARS-CoV-2 genotyping based on whole genome and/or partial domain sequencing, such as GISAID, PANGOLIN, and Nextclade. According to the European Centre for Disease Prevention and Control (ECDC), whole genome sequencing (WGS), or at least complete or partial S region sequencing, is the best method for assigning a specific lineage or variant [33].

A study addressing the genomic diversity of SARS-CoV-2 among North African countries, including Egypt, was conducted in December 2021 [34]. They analyzed a total of 1669 whole genome sequences, of which 971 high-coverage sequences were from Egypt. They reported the distribution of lineages as C.36 (30.6%), followed by B.1 (25.2%), C.36.3 (7.2%), B.1.1 and B.1.617.2, with 5.1% each according to the PANGOLIN tool.

A previous Egyptian study reported a shift in lineage prevalence from B.1 to B.1.1.1 between wave 1 and wave 2 [35]. However, we observed a shift in lineage from B.1 to C.36 between wave 1 and wave 2 in our study. This disagreement may be attributed to the current analysis being performed after the end of pandemic waves. According to GISAID, the C.36 lineage was detected early during the pandemic (in May2020) in Egypt and continued to circulate within the country at variable levels.

In this study, we aimed to evaluate the discriminatory power of each tool. All 3 tools showed comparable discriminatory power: GISAID (0.872), PANGOLIN (0.895), and Nextclade (0.866). Because the 3 software tools exhibit different nomenclature and classification systems for lineage assignment, discrepancies between tools were expected.

Here, we can demonstrate one particular discrepancy due to the different nomenclature systems. Among 1212 sequences, AY* sublineages were detected in 184, 44 and 58 sequences according to GISAID, Nextclade, and PANGOLIN, respectively. On the other hand, the B.1.617.2 lineage (parent lineage of AY*) was detected in 26, 166, and 152 sequences according to GISAID, Nextclade, and PANGOLIN, respectively. This may be explained by the improved ability of GISAID to classify sublineages to AY* rather than their parent lineage B.1.617.2. We confirmed this theory by ROC/AUC curves. All 3 tools showed high agreement with AUC > 85%, except in the case of lineage B.1.617.2, and the GISAID tool showed a poor AUC (57.5%) compared to PANGOLIN (94.7%).

Here, we conducted comparative analyses of COVID-19 genotyping derived based on the nucleocapsid region (28,274–29,533 in the NC_045512.2 reference genome) and spike region (21,563–25,384 in the NC_045512.2 reference genome) extracted from high-coverage whole genome sequences of 1212 COVID-19-infected patients from Egypt and submitted to the GISAID EpiCov database since the start of the pandemic.

In this study, we selected the Nextclade tool as the reference typing method for several reasons; it has a high ability to assign lineages (3925/3935, 99.7%) and hence can assign the majority of partial or low coverage sequences that were unassigned by other tools. Nextclade was able to assign almost the whole dataset except for 10 sequences.

Despite the presence of some discrepancies in lineage assignment between the tools, all 3 agreed on assigning the most common lineage circulating per wave during the pandemic in Egypt. B1 was the most common in wave 1, C.36 was most common in wave 2, C.36.3 was most common in wave 3, B.1.617.2 was most frequent in wave 4, BA.2 was most frequent in wave 5, BA.5.2 was the most frequent in wave 6, and recombinant forms (particularly XBB.1.9.1) became predominant (Fig. 2).

We proposed that the N gene may be superior in lineage assignment compared to the S gene. A statistically significant difference (p = 0.04) was observed between S and N agreement with the whole genome, suggesting that the N region agrees with the whole genome more than the S region. Despite, the higher agreement of N region (46%) with whole genome compared to spike agreement (30%), both regions are maybe less sufficient than whole genome which is the best for lineage determination. To the best of our knowledge, this work is the first to explore the ability of another region other than the spike protein for rapid lineage assignment for SARS-CoV-2 sequences.

Comments (0)

No login
gif