At the start of this work, the only publicly available sequence resource for S. officinalis was a transcriptome from the 1,000 Plants (1KP) project22. This resource is a single dataset derived from pooled plant organs and, thus, was not optimal for the discovery of saponarioside biosynthetic genes. We, therefore, elected to generate our own transcriptome data for S. officinalis. We first determined the content of SpA and SpB in different S. officinalis organs. Because commercial standards of these two saponins are not available, we purified SpA and SpB from dried S. officinalis leaf material and confirmed the structures of the isolated molecules by extensive one-dimensional (1D) and two-dimensional (2D) nuclear magnetic resonance (NMR) (Supplementary Figs. 3–21 and Supplementary Tables 1 and 2). We then carried out targeted high-resolution liquid chromatography–mass spectrometry (HR LC–MS) analysis of extracts from six different S. officinalis organs (flowers, flower buds, young leaves, old leaves, stem and root; Supplementary Fig. 22). SpA and SpB were identified by comparing the retention times (RTs) and tandem MS (MS/MS) fragmentation patterns with purified standards. Because of the limited availability of purified saponariosides standards, amounts of SpA and SpB in soapwort plants were quantified relative to an internal standard (digitoxin) (Extended Data Figs. 1 and 2). The accumulation patterns of the two saponariosides differed, with SpA being most abundant in the flowers and flower buds and SpB being most abundant in the young and old leaves. The combined levels of both saponins were low in the stems and leaves and highest in the flowers and flower buds (Fig. 1b).
We next performed Illumina paired-end RNA sequencing (RNA-Seq) on RNA from the six different organs (four biological replicates per organ). We also generated a pseudochromosome-level genome assembly of S. officinalis using PacBio single-molecule real-time circular consensus sequencing (CCS) and high-throughput chromosome conformation capture (Hi-C) sequencing technologies. PacBio long reads were assembled using HiFiasm23 and Hi-C data, resulting in 129 scaffolds with an N50 of 148.8 Mb. The largest 14 scaffolds contained 99.46% of the assembled sequences, forming 14 pseudochromosomes (Supplementary Tables 3 and 4). Both the genome size and the predicted chromosome number of S. officinalis reported here (2.0895 Gb; 1n = 14) correspond to values reported using flow cytometry24,25. The genome assembly was annotated using the RNA-Seq read alignments generated above and we additionally performed PacBio Iso-Seq CCS to aid in the annotation. Gene models were predicted using homology-based predictors and subjected to Pfam analysis to identify protein families, yielding 37,604 high-confidence protein-coding genes. Genome completeness was assessed using the Benchmarking Universal Single-Copy Orthologs (BUSCO) tool, which determines the presence or absence of highly conserved single-copy genes26. The BUSCO analysis revealed that the genome contained 95.2% of expected orthologs as complete single-copy genes, confirming our genome assembly and annotation to be of high quality. Syntenic analysis of the assembled genome was carried out versus other Caryophyllales species and the results showed clear macrosynteny with other species in Caryophyllaceae, as well as in Amaranthaceae (Extended Data Fig. 3).
Discovery of the biosynthetic genes for QAThe first step in triterpene biosynthesis involves the cyclization of the linear precursor 2,3-oxidosqualene to a range of diverse scaffolds by a family of enzymes known as oxidosqualene cyclases (OSCs)27. The aglycone core of SpA and SpB is QA, which is derived from one of the most common plant triterpenoid scaffolds, β-amyrin. We, therefore, initiated our search for saponarioside biosynthetic pathway genes by mining the translated S. officinalis genome for candidate OSCs. This revealed a total of four candidate OSC genes, including one predicted cycloartenol synthase (Saoffv11008135m), one predicted lupeol synthase (Saoffv11043295m) and two potential β-amyrin synthases (Saoffv11003490m and Saoffv11027757m) according to phylogenetic analysis (Fig. 2a). Saoffv11003490m showed overall low expression in all soapwort tissues compared to Saoffv11027757m and the relatively high phylogenetic branch length suggested that this may be a pseudogene or a diverged sequence from one carrying out β-amyrin synthesis (which is found in most higher plants); hence, it was not considered a likely candidate (Supplementary Table 5 and Fig. 2a). Functional analysis of Saoffv11027757m by Agrobacterium-mediated transient expression in the leaves of Nicotiana benthamiana revealed a product with the same gas chromatography (GC)–MS RT and mass spectrum as an authentic β-amyrin standard (1), confirming that this enzyme (hereafter named SobAS1) is indeed a β-amyrin synthase (Fig. 2b,c).
Fig. 2: Characterization of SobAS1.a, Phylogenetic analysis of candidate S. officinalis OSCs. The maximum-likelihood tree was generated using an amino acid alignment of putative OSCs in S. officinalis and previously characterized OSCs from other plant species (listed in Supplementary Table 6). Bootstrap values less than 80% are shown beside each node. The scale bar indicates the number of amino acid substitutions per site. Common enzyme products produced by each clade are labeled on the right. SobAS1, characterized in this work as a β-amyrin (1) synthase is highlighted in purple. The three other S. officinalis OSCs identified in this study are shown in bold. b, Transient expression of SobAS1 in N. benthamiana leaves. GC–MS total ion chromatograms (TICs) of leaf extracts coexpressing AstHMGR and SobAS1, along with a control (leaf expressing only AstHMGR) and a commercial standard of β-amyrin (1), are shown. Mass spectra for leaf extracts expressing SobAS1 and commercial β-amyrin standard are also given. c, Activity of SobAS1 in converting 2,3-oxidosqualene to β-amyrin (1).
We next performed coexpression analysis across different soapwort organs using SobAS1 as bait to identify candidate downstream pathway genes. The strength of coexpression was ranked using Pearson’s correlation coefficient (PCC)28. Although SobAS1 showed high expression in all soapwort organs, the highest absolute expression was in the flower, in accordance with our metabolite analysis (Supplementary Table 5 and Fig. 1b). Therefore, we only considered full-length candidates showing high coexpression with SobAS1 with highest expression in the flower. The resulting list was further filtered by prioritizing candidates annotated with InterPro domains for families of enzymes known to be involved in triterpene biosynthesis, including cytochrome P450s (CYPs; IPR001128), uridine diphosphate (UDP)-dependent glycosyltransferases (UGTs; IPR002213) and acyltransferases (ATs; IPR003480 and IPR001563)27 to give the shortlisted candidates shown in Extended Data Fig. 4.
The saponarioside scaffold QA (4) is a β-amyrin-derived triterpene oxidized at positions C-28, C-16α and C-23 (Fig. 3a). As triterpene scaffolds are commonly oxidized by members of the CYP family29, we investigated the functions of the seven candidate CYPs in our shortlist (Extended Data Fig. 4). Each of these CYP candidates was coexpressed with SobAS1 in N. benthamiana by transient plant expression and leaf extracts were analyzed by GC–MS and LC–MS to monitor for new product peaks. Our screening implicated three candidate CYPs (encoded by Saoffv11003497m, Saoffv11043486m and Saoffv11042705m) in QA biosynthesis. These were renamed CYP716A378, CYP716A379 and CYP72A984, respectively. Transient expression of CYP716A378 together with SobAS1 resulted in near-complete conversion of β-amyrin (1) to oleanolic acid (2) (identified on the basis of a comparison with a commercial standard) (Fig. 3b). When a second candidate, CYP716A379, was transiently expressed together with SobAS1, we observed the formation of a new peak that we identified as echinocystic acid (3) on the basis of a comparison with commercial standards (Fig. 3b and Supplementary Figs. 23 and 24). Coexpression of CYP72A984 with SobAS1 and CYP716A379 resulted in the formation of a new product with an RT, mass and MS/MS fragmentation pattern that matched those of QA (4) standard (Fig. 3c). We also observed the production of another peak (4′) with a different RT to QA (Fig. 3c, Supplementary Fig. 25). This may be the product of CYP72A984 performing two consecutive C-23 oxidations on residual oleanolic acid resulting in gypsogenic acid, which has the same [M − H]− as QA (Supplementary Figs. 23 and 25). Interestingly, the activity of CYP72A984 also led to accumulation of a product with m/z 501.3219 ([M − H]− of hydroxylated QA) (Supplementary Fig. 26). This compound may be 16α-hydroxygypsogenic acid (GAOH), which is also present in soapwort plants as a saponin backbone4,5,6. Hence, CYP72A984 may also be able to perform further C-23 oxidation on QA to form GAOH (Supplementary Fig. 23). In summary, CYP716A378 is able to introduce a carboxylic acid residue at the C-28 position of β-amyrin (1), CYP716A379 is a dual-functioning enzyme that is also able to carry out this modification and, in addition, has C-16α oxidation activity and CYP72A984 performs C-23 oxidation to yield QA (4) (Fig. 3a). The phylogenetic relationships of these CYPs with other previously characterized plant CYPs are shown in Supplementary Fig. 27.
Fig. 3: Biosynthesis of QA.a, Four S. officinalis enzymes enable biosynthesis of QA (4) in N. benthamiana. b, Products generated by transient expression of CYP716A378 (C-28 oxidase) and CYP716A379 (C-28,16α oxidase) in N. benthamiana. GC–MS TICs of leaf extracts coexpressing SobAS1 with either CYP716A378 or CYP716A379 are shown, along with a control (leaf expressing only AstHMGR) and the following commercial standards: bA (1, β-amyrin), OA (2, oleanolic acid) and EA (3, echinocystic acid). Mass spectra of bA (1), OA (2) and EA (3) for leaf extracts expressing SobAS1 with either CYP716A378 or CYP716A379 and for relevant commercial standards are also shown. c, Transient expression of CYP72A984 (C-23 oxidase) in N. benthamiana. LC–MS extracted ion chromatograms (EICs) of leaf extracts coexpressing CYP72A984 with the minimal gene set for 3 (SobAS1 and CYP716A379), along with a control (leaf expressing only AstHMGR) and a QA (4) commercial standard. EICs displayed are at m/z 485.3267 (calculated [M − H]− of QA (4)). MS and MS/MS spectra of QA (4) from the commercial standard and leaf extracts coexpressing SobAS1, CYP716A379 and CYP72A984 are also shown. Formation of another peak (4′) putatively identified as gypsogenic acid is also observed when CYP72A984 is coexpressed with SobAS1 and CYP716A379 (MS/MS shown in Supplementary Fig. 25).
Biosynthesis of the C-3 sugar chainHaving elucidated the steps required for the biosynthesis of QA (4), we next focused on the identification of candidate genes for the downstream pathway steps. SpA and SpB both have oligosaccharide chains attached at the C-3 and C-28 positions (Fig. 1a). The presence of a C-3 sugar chain is a common feature of triterpenoid saponins30. Additionally, the majority of saponins with a single sugar chain (monodesmosidic saponins) are decorated at the C-3 position of the aglycone rather than the C-28 position31. We, therefore, anticipated that the addition of the C-3 sugar chain was likely to occur first, followed by addition of the C-28 sugar chain.
The C-3 trisaccharide chain of SpA and SpB consists of d-glucuronic acid, d-galactose and d-xylose (Fig. 1a). The sugar that is directly attached to the C-3 position of QA is d-glucuronic acid. UDP-dependent sugar transferases belonging to glycosyltransferase family 1 (GT1) are typically responsible for the glycosylation of plant natural products32. However, several cellulose synthase-like (CSL) enzymes have also recently been reported to be involved in the 3-O-glucuronidation of triterpene aglycones33,34,35. We observed a predicted CSL hit (Saoffv11064433m) that showed high coexpression with SobAS1 (Extended Data Fig. 4). Phylogenetic analysis of this candidate revealed that Saoffv11064433m is a member of the CsyGT/CSLM family, which appears to be a well-conserved subgroup containing 3-O-glucuronic acid transferases (Supplementary Fig. 28). This was, therefore, prioritized for functional analysis. This gene was transiently expressed in N. benthamiana leaves along with the minimal gene set required to produce QA (4) (SobAS1, CYP716A379 and CYP72A984). LC–MS analysis of leaf extracts revealed a new peak (5) with a mass and MS/MS fragmentation pattern corresponding to the authentic 3-O--QA standard (5, hereafter abbreviated as QA-Mono) (Supplementary Fig. 29). On the basis of these results, we named this enzyme SoCSL1 (Fig. 4a). We also observed the accumulation of a minor product with m/z 677.3537 (Supplementary Figs. 23 and 30a). MS/MS analysis of this peak resulted in a loss of 176 (glucuronic acid moiety) from the parent ion with m/z 501.3231 (calculated [M − H]− of GAOH) (Supplementary Fig. 30b). Therefore, in addition to QA (4), SoCSL1 may act on GAOH putatively produced by the C-23 oxidation activity of CYP72A984 on 4. However, compared to the m/z 677.3537 product peak, m/z 661.3588 (QA-Mono, 5) is the major product formed when SoCSL1 is coexpressed with the QA (4) biosynthetic genes (Supplementary Fig. 23). This suggests that SoCSL1 may efficiently convert 4 to 5, thus pushing the equilibrium toward the production of saponins containing 4 as an aglycone, rather than GAOH.
Fig. 4: Complete biosynthetic pathway to SpB (13).a, Integrated peak areas of EICs for each intermediate accumulating after sequential coexpression of pathway genes in N. benthamiana, starting with QA (4). Each bar represents the mean of six biological replicates and error bars indicate the s.e.m. QA (4) biosynthetic genes include SobAS1, CYP716A379 and CYP72A984. Data for full characterization of each enzyme are available in the Supplementary Information. b, Schematic showing the complete elucidated pathway from 2,3-oxidosqualene to SpB (13). The arrows represent the accumulation of metabolite products after each addition of associated enzyme rather than specifying a biosynthetic order in planta. Superscript circles (●) indicate structures that are supported by NMR analysis of the purified compound (reported here or in a previous study35) or by comparison with an authentic standard. MW, molecular weight.
We next screened the ten candidate UGTs in our shortlist of genes that were coexpressed with SobAS1 (Extended Data Fig. 4) for the ability to elongate the C-3 sugar chain. Each candidate was coexpressed one by one with the gene set needed for biosynthesis of QA-Mono (5) (SobAS1, CYP716A379, CYP72A984 and SoCSL1) and leaf extracts were analyzed by LC–MS. Coexpression of UGT73DL1 with the QA-Mono biosynthetic genes revealed a new peak (6) with a mass ([M − H]− = m/z 823.4116) consistent with the addition of a hexose to QA-Mono. The RT, mass and fragmentation pattern of this product matched those of an authentic standard of 3-O--QA (6, hereafter abbreviated as QA-Di) (Fig. 4, Supplementary Fig. 31). The subsequent coexpression of UGT73CC6 with UGT73DL1 and QA-Mono (5) biosynthetic genes led to another new product peak (7) with a mass ([M − H]− = m/z 955.4539) corresponding to 6 plus a pentose and an MS/MS fragmentation pattern that matched with a 3-O--QA authentic standard (7, hereafter abbreviated as QA-Tri) (Fig. 4 and Supplementary Fig. 32). Thus, UGT73DL1 and UGT73CC6 are able to extend the C-3 sugar chain through the addition of a d-galactose and a d-xylose, respectively. These two phylogenetically related UGTs are both located within group D of the GT1 superfamily (Supplementary Fig. 33).
Biosynthesis of the C-28 sugar chainWe next focused our efforts on elucidation of the steps required for the addition of the main linear C-28 sugar chain of SpB, which is composed of d-fucose linked to a trisaccharide chain consisting of l-rhamnose and two d-xyloses (Fig. 1a). We revisited the remaining eight UGT candidates in our shortlist (Extended Data Fig. 4) and coexpressed each of these in N. benthamiana leaves with the gene set required for the biosynthesis of QA-Tri (7). The first sugar at the C-28 position is d-fucose. Transient coexpression of UGT74CD1 with the saponarioside biosynthetic genes identified so far resulted in the formation of a product (8) with the same RT, mass and MS/MS fragmentation pattern as the authentic standard of 3-O--28-O--QA (8, hereafter abbreviated as QA-TriF) and was identified as such (Fig. 4 and Supplementary Fig. 34). However QA-TriF (8) accumulated at very low levels and was expected to impede the elucidation of further downstream genes. Poor accumulation of d-fucosylated saponins in N. benthamiana was also previously observed and suggested to indicate that UDP-α-d-fucose might be limiting33,35. We recently showed that this sugar nucleotide is not likely to be relevant for production of the d-fucose moiety found in the structurally related triterpene glycosides from the Chilean soapbark tree35. Instead, UDP-4-keto-6-deoxy-glucose, (an intermediate in UDP-l-rhamnose biosynthesis) acts as the sugar donor for transfer of 4-keto-6-deoxy-glucose to the backbone before being reduced in situ to d-fucose by the short-chain dehydrogenase–reductase (SDR) QsFucSyn, which functions as a 4-ketoreductase35. During our coexpression analysis we found an SDR candidate (Saoffv11002756m) that showed strong coexpression with SobAS1 (PCC = 0.941) and a high level of absolute expression in the flower organ (Extended Data Fig. 4). The predicted SDR shared 57.2% amino acid sequence identity with QsFucSyn. The transient coexpression of this SDR (renamed SoSDR1) with UGT74CD1 and QA-Tri (7) biosynthetic genes led to a significant increase in the production of 8 (Supplementary Fig. 34). Our results suggest that fucosylation of QA-Tri (7) may follow the same mechanism as found in soapbark. UGT74CD1 may transfer 4-keto-6-deoxy-glucose to 7, which is subsequently reduced to d-fucose by the activity of SoSDR1, resulting in the production of QA-TriF (8). Next, the additional coexpression of UGT79T1 with gene set required to produce 8 led to near conversion of 8 to a new product (9) with the expected mass of 8 plus a deoxyhexose ([M − H]− = m/z 1,247.5679) (Fig. 4 and Supplementary Fig. 35). MS/MS analysis of this new product revealed a major fragment ion with mass corresponding to QA-Tri (7). This suggested that the addition of deoxyhexose is on the d-fucose moiety of 7, forming a disaccharide chain that fragments off together ([M − 146 − 146 −H]− = m/z 955.4539) (Supplementary Fig. 35b). On the basis of our results, we putatively identified this new product as 3-O--28-O--QA (9, hereafter abbreviated as QA-TriFR).
Additional rounds of screening led to the discovery of two UGTs with activity toward 9 and the downstream product. The coexpression of UGT79L3 with the saponarioside biosynthetic genes identified so far resulted in a noticeable depletion of 9 and accumulation of a new product (10) with the anticipated mass of 9 plus a pentose ([M − H]− = m/z 1,379.6119), suggesting the addition of d-xylose and formation of 3-O--28-O--QA (10, hereafter abbreviated as QA-TriFRX) (Fig. 4 and Supplementary Fig. 36). The subsequent coexpression of UGT73M2 together with UGT79L3 and the set of genes predicted to be required for the biosynthesis of 9 led to the formation of a product (11) with a mass ([M − H]− = m/z 1,511.642) consistent with the addition of a pentose to 10 (Fig. 4 and Supplementary Fig. 37). We anticipated this product to be 3-O--28-O--QA (11, hereafter abbreviated as QA-TriFRXX). MS/MS analyses of both 10 and 11 revealed a major fragment ion with mass corresponding to QA-Tri (7), suggesting that UGT79L3 and UGT73M2 are both involved in the elongation of the C-28 sugar chain rather than acting upon the aglycone itself (Supplementary Figs. 36b and 37b). On the basis of our results, we putatively identified UGT79L3 as a xylosyltransferase that acts on QA-TriFR (9) to produce QA-TriFRX (10) and UGT73M2 to be another xylosyltransferase that adds the terminal d-xylose to the main C-28 sugar chain.
The discovery of UGT74CD1, SoSDR1, UGT79T1, UGT79L3 and UGT73M2 completes the set of genes required to produce the main linear part of the C-28 sugar chain present in SpA and SpB. Phylogenetic analysis of these UGTs revealed UGT74CD1 to be a member of GT1 group L, which contains ester-forming GTs, and UGT79T1 and UGT79L3 to be members of GT1 group A, a group known to contain GTs that elongate glycosidic branches32 (Supplementary Fig. 33). Together with UGT73DL1 and UGT73CC6, which are involved in the building of the C-3 sugar chain, UGT73M2 grouped within the GT1 group D subfamily UGT73 (Supplementary Fig. 33).
Addition of d-quinovose by a noncanonical TGThus far, we have identified the genes and enzymes that are anticipated to produce QA-TriFRXX (11). The missing steps needed to complete the biosynthetic pathway to SpB are those required for the addition of 4-O-acetylquinovose to 11. Although d-quinovose is a common feature of specialized metabolites produced by marine animals such as starfish and sea cucumbers36, it is considered unusual as a component of plant metabolites37. Consequently, little to none is known about the mechanisms of addition of d-quinovose to plant natural product scaffolds38. Although GTs associated with plant natural product biosynthesis typically belong to family 1 of the GT superfamily, none of the UGTs in our candidate shortlist showed quinovosyltransferase activity toward 11. We noted, however, that a gene predicted to encode a member of a different class of carbohydrate-active enzymes, GH1 transglycosidase (TG), was highly coexpressed (PCC = 0.971) with SobAS1 (Extended Data Fig. 4). When we expressed this gene (Saoffv11054913m) with the other identified saponarioside pathway genes in N. benthamiana, two new products (12 and 12′) with different RTs but the same mass ([M − H]− = m/z 1,657.7121), corresponding to the expected mass of 11 plus deoxyhexose, were observed (Supplementary Fig. 38). These two products both had the same fragmentation pattern when analyzed by MS/MS. The main fragment ions were m/z 1,525.6699 and m/z 955.4539 ([M − H]− of 7), which suggested a loss of pentose, followed by the loss of the remaining C-28 sugar chain, resulting in 7 (Supplementary Fig. 38b). As the anticipated product, 3-O--28-O--QA (hereafter abbreviated as QA-TriF(Q)RXX), is not commercially available, we generated an authentic QA-TriF(Q)RXX standard by purifying the target saponin from extracts of S. officinalis flowers, followed by extensive 1D and 2D NMR analysis for structural confirmation (Supplementary Figs. 39–49 and Supplementary Table 7). When we compared 12 and 12′ with the authentic QA-TriF(Q)RXX standard, we observed that, although the MS/MS fragmentation of both products matched the QA-TriF(Q)RXX standard, only 12 had the same RT (Supplementary Figs. 38 and 50).
We then carried out large-scale transient expression using 110 N. benthamiana plants and attempted to purify 12. Because of its low accumulation, only a crude sample of 12 was obtained even after extensive purification steps. However, 1D and 2D NMR analysis on this rudimentary sample supported the identity of 12 as QA-TriF(Q)RXX (Supplementary Figs. 51–61 and Supplementary Table 8). Taken together, our data suggest that this GH1 TG (which we call SoGH1), is involved in the addition of d-quinovose to d-fucose moiety of QA-TriFRXX (11), resulting in the production of QA-TriF(Q)RXX (12). Additionally, the matching fragmentation pattern of 12 and 12′ may suggest that these are positional isomers of the terminal d-xylose in the C-28 sugar chain of 12 (Supplementary Fig. 38c). The order of enzyme activity in planta may occur in a complex network and UGT73M2 may transfer d-xylose to d-quinovose after the activity of SoGH1.
GH1 TGs are an emerging class of sugar transferases with roles in plant specialized metabolism. These enzymes use acyl sugars rather than nucleotide sugars as the sugar donors39. The limited number of GH1 TGs characterized so far all transfer glucose40,41,42,43,44,45,46, with the exception of one galactosyltransferase47. Our phylogenetic analysis clustered SoGH1 with the At/Os6 subfamily as designated by Opassiri et al.48, which contains most of the previously characterized GH1 TG natural product sugar transferases (Fig. 5a). GH1 enzymes typically have N-terminal signal peptides48,49 and all reported GH1 TGs in the At/Os6 subfamily contain signal peptides predicted to target the vacuole40,41,42,43,44,45,46,47. Intriguingly, signal sequence analysis by SignalP 5.0 (ref. 50) (Fig. 5a) and amino acid alignment of SoGH1 with other characterized members of At/Os6 (Fig. 5b) indicated that SoGH1 lacks an N-terminal leader sequence (Fig. 5b). We next investigated SoGH1 localization by generating C-terminal mRFP (monomeric red fluorescent protein)-tagged SoGH1 recombinant protein (SoGH1:mRFP). Confocal microscopy of N. benthamiana leaves coinfiltrated with expression constructs for SoGH1:mRFP and free GFP (green flu
