Identification of the protein coding capability of coronavirus defective viral genomes by mass spectrometry

In the current study, nanopore direct RNA sequencing and liquid chromatography-tandem mass spectrometry (LC–MS/MS) analysis were employed to examine whether DVGs can encode proteins in infected cells. With the protein databases generated by nanopore direct RNA sequencing, six DVG-encoded proteins were identified by LC–MS/MS based on the featured fusion peptides caused by recombination during DVG synthesis. The limitations and the biological significance of the study are discussed.

Below, we explain why 34,104 (by total cell lysates) and 34,056 (by cell lysates derived from RNA–protein pull-down assay) protein species were identified by LC–MS/MS analysis. First, coronavirus DVGs are recombination products and thus contain ORFs of various lengths from one or more portions of ORFs in the full-length genome. As a result, many DVG species (145,015) are identified by nanopore direct RNA sequencing, and thus, many potential DVG-encoded protein sequences (189,221) can be used as protein reference databases for LC–MS/MS. Second, the diverse genome structures of DVGs may encode in-frame peptides that have the same amino sequences as those encoded from the full-length genome. Consequently, if the peptides determined by LC–MS/MS analysis match the amino acid sequences of the DVG-encoded proteins and the protein scores are higher than 41, the DVG-encoded protein species can be identified based on the provided protein reference databases. Consequently, many DVG-encoded protein species (34,104 from total cell lysates, and 34,056 from cell lysates by RNA–protein pull-down assay) were identified by LC–MS/MS analysis. However, this may lead to false-positive results because the peptides that match the amino acid sequence of DVG-encoded proteins may also be encoded from the full-length coronavirus genome, as described above, and thus cannot be used as markers to determine whether the identified proteins are encoded by coronavirus DVGs. That is also the reason why we propose that if the peptides contain discontinuous in-frame amino acid sequences derived from different portions of amino acid sequences from full-length genome-encoded proteins or contain out-of-frame amino sequences, the peptides are fusion peptides encoded from DVGs because DVGs are synthesized by recombination of the viral genome. Therefore, these fusion peptides can be used as markers to identify the proteins actually encoded by coronavirus DVGs. Consequently, 6 DVG-encoded proteins were identified through the identification of 6 fusion peptides, as shown in Figs. 1, 2 and 3.

In addition, because the read number for the 6 DVGs is low (only 1), whether there is a correlation between the abundance of DVGs identified by nanopore direct RNA sequencing and that of their encoded proteins identified by LC-MS/MS remains unknown. Our explanation for the results is as follows. Because coronavirus DVGs are recombination products and thus contain ORFs of various lengths from one or more portions of ORFs derived from the full-length genome, the diverse genome structures of DVGs may encode in-frame peptides that have the same amino sequences as those encoded from the full-length genome. Consequently, if the peptides determined by LC-MS/MS analysis match the amino acid sequences of DVG-encode proteins and the protein scores are higher than 41, the DVG-encoded protein species are identified based on the provided protein reference databases. However, the peptides which match the amino acid sequence of DVG-encoded proteins may also be encoded from full-length coronavirus genome, and thus we cannot determine whether the identified peptides and thus the proteins are encoded from coronavirus DVGs or full-length genome. Consequently, DVG species with higher read numbers may encode more proteins, but without the featured fusion peptides as markers, whether there is a correlation between the abundance of DVGs identified by nanopore direct RNA sequencing and that of their encoded proteins identified by LC-MS/MS still cannot be determined. That is also the reason why we propose that, as described above, if the peptides contain discontinuous in-frame amino acid sequences derived from different portions of amino acid sequences from full-length genome-encoded proteins, or contain out-of-frame amino sequences, they are fusion peptides encoded from DVGs. Thus, at the current stage, we can only conclude that DVG can encode protein, and whether there is a correlation between the abundance of DVGs and that of their encoded proteins remains unknown. However, since the identified 6 DVGs with read number of 1 have the capability to encode proteins as determined by the current study, we can speculate that other DVG species with higher read numbers may also have the capability to encode protein although they cannot encode featured fusion peptide as markers to determine the proteins-coding capability.

It has been known that (i) coronavirus DVGs can be packaged [31], (ii) coronavirus N protein can inhibit host innate immunity [32] and (iii) innate immunity is the first line of host defense against virus infection [33]. In addition, based on the protein databases derived from the results of nanopore direct RNA sequencing in the current study, it is suggested that some DVG-encoded fusion proteins contain part or complete N protein. It is therefore speculated that one of the functions for coronavirus DVG-encoded fusion proteins is to regulate innate immunity, affecting virus replication and subsequent pathogenicity. On the other hand, coronavirus N protein has also been suggested to be important for replication and transcription (synthesis of coronavirus sgmRNAs including sgmRNA N) [34, 35]. However, N protein can only be synthesized from sgmRNA N, and consequently, the question is how coronavirus genome replicates and transcribes sgmRNAs before N protein is synthesized. As described above, because (i) coronavirus DVGs can be packaged [31], (ii) some DVGs contain partial or complete N protein ORF and (iii) DVGs can be translated as evidenced by the results of the current study, it is also argued that, after entry into the cells, the released DVGs with partial or complete N protein ORF can be immediately translated into N-containing fusion proteins, which in turn can facilitate the full-length coronavirus genome for subsequent replication and transcription before N protein is synthesized from sgmRNA N. According to the argument above, the DVG-encoded fusion proteins in coronaviruses including SARS-CoV-2 may have impact on pathogenesis through affecting innate immunity and replication. Lastly, it is also proposed that other coronavirus DVGs which encode other species of fusion proteins or out-of-frame novel proteins (when compared with the original ORFs in the full-length genome) may have different effects from those described above on pathogenesis although the functions of their encoded proteins remain to be determined. It is worth noting that, based on the previous study [26], the species and amounts of DVGs can be altered under different infection conditions such as in different infected cells and under different selection pressures. Since DVGs can encode various proteins, such alterations in the amounts and species of DVGs and thus the encoded proteins may be a way for coronavirus to respond to environmental changes, also contributing to the coronavirus pathogenesis.

The possible reasons why the featured fusion peptide was not detected in the total cell lysates by LC–MS/MS are as follows. First, because there are too many species of DVGs in cells, the amount of each DVG-encoded protein (especially the protein with the featured fusion peptide) in a fixed amount of cell lysate may not be sufficient to be detected by LC–MS/MS. Second, not every DVG-encoded protein contains the featured fusion peptides (based on the protein reference databases generated by nanopore direct RNA sequencing for BCoV), further limiting the identified number of protein species. Third, because SuperScript™ III reverse transcriptase (cat No. 18,080,044, Thermo Fisher Scientific, Waltham, USA), which is optimized to synthesize first-strand cDNA up to ~ 12 kb, was used for nanopore direct RNA sequencing, the identified coronaviral RNA species, including DVGs, may not cover all coronavirus transcripts, especially those of longer size. Thus, the protein reference databases may not contain the full information of the DVG-encoded proteins, limiting the number of protein species identified by LC–MS/MS analysis.

As shown in Figs. 1, 2 and 3, it is suggested that DVGs have the capability to encode proteins as determined by RNA–protein pull-down assay followed by LC–MS/MS. The results indicate that other DVGs may also have the capability to encode proteins. Consequently, the DVG-encoded proteins may play important roles during coronavirus infection. Thus, the current results may suggest an attractive field of study regarding the biological functions of proteins encoded by DVGs. Determining the function of DVG-encoded proteins is a priority to understand their roles in coronavirus pathogenesis. The outcomes of these studies may contribute to the development of antiviral strategies.

Comments (0)

No login
gif