A large-scale genomically predicted protein mass database enables rapid and broad-spectrum identification of bacterial and archaeal isolates by mass spectrometry

Microorganisms

The bacterial and archaeal strains used in this study are listed in Additional file 2: Table S1; most strains were obtained from the Japan Culture of Microorganisms (JCM), Biological Resource Center, NITE (NBRC), or Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH (DSMZ). Some strains were originally isolated and maintained in our laboratory. In total, 94 strains (84 bacterial and 10 archaeal strains) were used, representing a wide range of phylogenetically distinct bacterial and archaeal lineages (15 phyla in total). Anaerobic strains were cultivated using a basal medium described previously [17]. Anaerobic cultivations were performed at either 37 or 55°C in 50-ml serum vials containing 20 ml of medium (pH25°C 7.0) under an atmosphere of N2/CO2 (80:20, v/v). Neutralized substrates (and/or hydrogen) were added to the vials prior to inoculation (Additional file 2: Table S1). Aerobic strains were grown with either 802 or 909 medium (Microlunatus Medium) described in the NBRC online catalog (https://www.nite.go.jp/nbrc/catalogue/). Growth temperature, atmosphere, basal medium, and substrate used are shown in Additional file 2: Table S1.

For some Actinomycetota, four different cultivation conditions were evaluated; (1) liquid-medium culture with a complex (carbohydrate-containing) medium (227 medium without agar or 802 medium without agar), (2) solid-medium culture with a complex medium (227 medium or 802 medium), (3) liquid-medium culture with a protein-based medium (230 medium without agar or DMS92 medium without agar), and (4) solid-medium culture with a protein-based medium (230 medium or DSM92 medium) (Additional file 5: Table S4). For MALDI-TOF MS analysis, grown cells were harvested at the stationary growth phase and stored in 100% ethanol at − 20°C. Strains lacking public genome sequences were subjected to genome sequencing as described below and cell pellets of these strains were also stored in RNAlater Stabilization Solution (Invitrogen) at − 20°C.

Mouse fecal samples

Feces of healthy wild-type C57BL/6 mice were used as the inoculum for the cultivation of anaerobes. Fecal pellets (2–3 pellets) from two males were collected and cut into two pieces; one piece was transferred to a 1.5-ml tube with RNAlater Stabilization Solution for DNA extraction and the other piece was homogenized in another 1.5-ml tube containing the anaerobic basal medium described above, followed immediately by placing the homogenates in the medium into an anaerobic 50-ml serum vial with the same basal medium for cultivation. Feces in RNA later were stored at − 20°C until DNA extraction. All the fecal samples were mixed and used for further analyses.

Cultivation of microbes from fecal samples

The fecal sample in an anaerobic 50-ml serum vial was immediately subjected to cultivation of bacterial cells using solid EG medium (JCM 14 medium) (http://jcm.brc.riken.jp). Incubation was performed under anaerobic conditions (N2/CO2/H2, 92:5:3, v/v) at 37°C for 3 days. More than 100 colonies on the plates were randomly selected and anaerobically subcultured with liquid YCFA medium (JCM 1130 medium) under the conditions described above. We note that no further purification step of colonies was performed since we sought to examine the performance of GPMsDB identification for rapid screening of colonies. The grown cells were then subjected to MALDI-TOF MS analysis and full-length 16S rRNA gene sequencing (described below).

DNA extraction and Illumina sequencing

DNA extraction and purification from pure cultures was performed using the MagAttract HMW DNA Kit (Qiagen). Short-read sequencing libraries were prepared using the Illumina DNA Prep Kit (formerly, Nextera DNA Flex, Illumina) starting from 10 ng of input DNA and sequenced on a NextSeq 500 instrument. DNA extraction from the mixed fecal sample was performed using the method described previously [18]. Two DNA extracts were prepared from the same sample to facilitate differential coverage population genome binning [19, 20]. Briefly, the sample was divided into two fractions; one was used for DNA extraction with ISOSPIN-based bead-beating (three rounds of beating for 1 min beating each time [18]), and the remainder was used for DNA extraction with the same ISOSPIN-based method without bead-beating (the three rounds of beating were replaced with weak vortexing). The DNA concentration of the extracts was measured with a Qubit dsDNA BR Assay Kit (Thermo Fisher Scientific). DNA library preparation was performed using the SMARTEer ThruPLEX Seq Kit (TaKaRa) with Covaris-based DNA fragmentation as described previously [18], targeting an average fragment size of 300 bp. DNA sequencing of the two libraries was performed with a single sequencing run using the NextSeq 500/550 Mid Output Kit v2.5 (300 cycles). Basic statistics of all the sequencing data are shown in Additional file 3: Table S2.

Nanopore sequencing for pure cultures

For Geobacter hydrogenophilus H-2, Pseudothermotoga lettingae TMO, and Thermodesulfovibrio islandicus TSL-P1, Oxford Nanopore Technologies (ONT) sequencing libraries were prepared using the SQK-PBK004 PCR Barcoding Kit. For Agromyces rhizosphaerae 14, Brachybacterium conglomeratum 5–2, Desulforhabdus amnigena ASRB1, Glycomyces algeriensis LLR-39Z-86 and Tepidanaerobacter syntrophicus OL, ONT sequencing libraries were prepared using the SQK-LSK109 Ligation Sequencing Kit and EXP-NBD104 Native Barcoding Expansion pack. Sequencing was performed on an R9.4.1 flow cell using an ONT MinION device. Basic statistics of the sequencing data are shown in Additional file 3: Table S2.

Genome and metagenome assembly and binning

For the Illumina sequencing data, binary base call sequence files were converted to FASTQ format using Illumina’s bcl2fastq Conversion Software (version 2.20.0.422). For the ONT sequencing data, basecalling was performed with Guppy (version 4.5.3; https://community.nanoporetech.com) using the high-accuracy model (command line flags –config dna_r9.4.1_450bps_hac.cfg), and library demultiplexing and trimming of barcodes (command line flag –trim_barcodes); short and low-quality reads (< 1000 bp and average quality of < 9) were discarded. For the Illumina sequencing data, reads were quality controlled with fastp (version 0.20.0 [21]), specifying command line flags –trim_tail1 1 –trim_tail2 1 –cut_right –cut_right_window_size 4 –cut_right_mean_quality 18 –trim_poly_x –poly_x_min_len 10 –n_base_limit 0 –low_complexity_filter –length_required 50. Long-read assembly was performed using Flye (version 2.9 [22]), with default parameters. The long-read assembly was then polished using the quality-controlled ONT reads with Racon (version 1.5.0 [23]), specifying command line flags -m 8 -x -6 -g -8 -w 500, followed by Medaka (version 1.2.3; https://github.com/nanoporetech/medaka), specifying command line flag -m r941_min_high_g360. A final polishing step was performed using the quality-controlled Illumina short reads with Polypolish (version 0.5.0 [24]), with default parameters. The strains lacked nanopore long-read data due to difficulty in cultivating them in a large-scale culture (Additional file 3: Table S2). Illumina read pairs were merged using Pear v0.9.11 [25] and quality control of unmerged pairs was performed with Trimmomatic v0.39 [26], with specifying options “LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36.” The resulting merged/unmerged reads were used for assembly using SPAdes v3.13.1 [27] with the options “careful” and “-k 77,99,127.” For metagenomes of mouse feces, in total, 60 million read pairs and 54 million read pairs for DNA extracts with and without bead-beating were obtained and quality control of these reads was performed using fastp v0.20.0 [21] as described previously. MetaSPAdes v3.15.4 [28] was used for coassembly and assembly for individual samples with options “–only-assembler” and “-k 77,97,127”. Assembly statistics are reported for contigs ≥ 1000 bp (Additional file 3: Table S2). Reads were mapped to contigs with Bowtie v2.4.1 [29] with default parameters and the coverage of contigs was calculated with Samtools v1.15.1 [30]. MAGs were independently recovered from assemblies for each sample using MetaBAT v2.15 [31] with the option “–minContig 1500”. In parallel, Maxbin v2.2.7 [32] was used with coassembly contigs using differential coverage information for the two samples (with an option “-min_contig_length 1500”). Similarly, genome binning with Concoct v1.1.0 [33] was performed with coassembly contigs using differential coverage information for the two samples (with the option “–length_threshold 1500”). The resulting genome bins were processed with DAS tool v1.1.3 [34] with default settings to integrate the bins into an optimized, nonredundant set of bins. The completeness and contamination of the genome bins in the nonredundant set were estimated using CheckM using lineage-specific marker genes and default parameters. CoverM v0.6.1 (https://github.com/wwood/CoverM) was used to estimate the abundance (%) of each genome bin in the sample (shotgun data with bead-beating DNA extraction) with the “genome” function with Minimap2 v2.17-r941 [35] and Samtools v1.13. Kraken2 v2.1.1 [36] was used to estimate the taxonomic composition of the sample (bead-beating DNA data) with default parameters except for the option “–confidence 0.05,” using the reference genome data GTDB-r202 (R06-RS202) [6]. Basic statistics of the assembly are shown in Additional file 3: Table S2.

Full-length 16S rRNA gene sequencing

Near full-length 16S rRNA gene amplicon libraries for long-read sequencing (Oxford Nanopore Technologies) were prepared by two-step tailed PCR directly from colonies of the isolates. To this end, a loopful of cells was collected and resuspended in 20 µl of nuclease-free water. In the first round of PCRs, the V1 to V9 regions of the 16S rRNA gene were amplified using the primers 5′-TTTCTGTTGGTGCTGATATTGCAGAGTTTGATCMTGGCTCAG-3′ (27F forward primer, the locus-specific region is underlined) and 5′-ACTTGCCTGTCGCTCTATCTTCTACGGYTACCTTGTTACGACTT-3′ (1492R forward primer). Reactions (20 µl) contained 1 × Platinum SuperFi II Green PCR Master Mix, 500 nM each of the forward and reverse primers, and 1 µl of cell suspension. The thermal cycling conditions were as follows: 98°C for 3 min; 30 cycles of 98°C for 10 s, 60°C for 10 s and 72°C for 45 s; 72°C for 5 min. Following verification of amplicon sizes by agarose gel electrophoresis, PCR products were purified using the Agencourt AMPure XP PCR Purification System (0.6 × volume of beads) and eluted in 40 µl of nuclease-free water. The second round of PCRs (25 µl) contained 1 × Platinum SuperFi II Green PCR Master Mix, 0.5 µl of barcoded PCR primers (“PCR Barcode” from ONT’s PCR Barcoding Expansion 1–96, EXP-PBC096), and 1 µl of purified first-round PCR product. Amplification was performed for 8 cycles using the temperature profile described above, PCR products were purified as described above and eluted in 35 µl of nuclease-free water. Amplicon concentrations were determined using the D5000 DNA ScreenTape Assay and Agilent 4200 TapeStation, and sequencing libraries combined in equimolar proportions. The pooled libraries were then further processed using ONT’s Ligation Sequencing Kit (SQK-LSK110) following the manufacturer’s instructions. Sequencing was performed on an R9.4.1 flow cell using a MinION device. Two 3-h sequencing runs were performed on a single flow cell and the flow cell washed using ONT’s Flow Cell Wash Kit (EXP-WSH004) between sequencing runs. Guppy (version 6.0.1; https://community.nanoporetech.com) was used for basecalling with the superaccuracy model (command line flags –config dna_r9.4.1_450bps_sup.cfg) and library demultiplexing (command line flags –require_barcodes_both_ends –trim_barcodes –barcode_kits "EXP-PBC096"). Sequences with both forward and reverse PCR primers, allowing up to 3 mismatches, anchored at the end of the reads were then identified and trimmed using Cutadapt (version 3.5 [37]); reads without identifiable primers were discarded. Reads with a length of 1300 to 1700 bp and an average quality of 12 were retained using SeqKit (version 2.2.0 [38]) and (re)oriented using VSEARCH’s (version 2.18.0; [39]) orient function using the Ribosomal Database Project Training Set 18 [40]. Next, sequences from each library were individually clustered using isONclust (version 0.0.6.1 [41]), with default parameters. For each cluster, 250 sequences with the highest average base quality and a length within one standard deviation of the mean were retained. These sequences were then compared against each other using VSEARCH’s allpairs_global function, specifying –id 0.9, and the sequence with the highest average identity to all other sequences was retained. This sequence was then polished with ONT’s Medaka (version 1.6.1; https://github.com/nanoporetech/medaka), specifying command line flag -m r941_min_sup_g507. Polished consensus sequences were compared against the 16S rRNA gene sequences associated with GTDB r95 (R05-RS95) using VSEARCH, specifying command line flags –id 0.9 –maxaccepts 100 –maxrejects 25 –query_cov 0.95, and default identity definition. Taxonomy was assigned based on the match with the highest sequence identity, including multiple matches.

Sample pretreatment for MALDI-TOF MS

For the preparation of MALDI-TOF MS measurement, the following four preparation methods were used: (1) simple formic acid-based method (FA), (2) bead-beating in 2,2,2-trifluoroacetic acid (TFA-beating), (3) heating in formic acid with beating in ethanol (FA heating), and (4) simple formic acid heating method (simple FA heating), which were developed based on the methods previously described [42,43,44]. For the first method (FA), briefly, 1–20 ml of grown culture was transferred into 1.5–15 ml tubes and centrifuged at 5000–10,000 × g for 5–10 min. Then, the supernatant was discarded, and 100–500 µL of 100% ethanol was placed on the cell pellet. After centrifugation at 10,000 × g for 5 min, ethanol was discarded, and cell pellets were dried at room temperature. Ten microliters of formic acid solution (70%) was then added to the cell pellet and mixed, followed by the addition of 10 µL of acetonitrile. Vortexing was then performed for 30 min at room temperature. After centrifugation at 10,000 × g for 5 min, 1 µL of the supernatant was spotted onto MALDI plates and air-dried. One microliter of alpha-cyano-4-hydroxycinnamic acid (CHCA) solution (10 mg/mL, in the presence of 50% acetonitrile and 1% trifluoroacetic acid) was used as the matrix solution. For performing TFA-beating, cell pellets were prepared, and 100% ethanol was added as described above. Then, 0.5 g of 0.1-mm autoclaved zirconia beads was added, followed by bead-beating using the FastPrep-24 instrument for 60 s at a speed of 6 m/s 3 times. After centrifugation at 10,000 × g for 5 min, the ethanol was discarded, and the cell pellets were dried at room temperature. Then, 75 µL of 50% acetonitrile in 1% trifluoroacetic acid solution was added, and the mixture was homogenized. Then, the solution was transferred to a 2.0 mL screwcap tube containing 50 mg of 0.1-mm autoclaved zirconia beads, followed by bead-beating using the FastPrep-24 instrument for 60 s at a speed of 6 m/s 1–4 times. After centrifugation at 10,000 × g for 5 min, 1 µL of the supernatant was spotted onto MALDI plates and air-dried. One microliter of the same matrix solution was added to the spots and air-dried. For FA heating, cell pellets were prepared, and 100% ethanol was added as described above. Then, 0.5 g of 0.1-mm autoclaved zirconia beads was added, followed by bead-beating using the FastPrep-24 instrument for 60 s at a speed of 6 m/s 3 times. After centrifugation at 10,000 × g for 5 min, the ethanol was discarded, and the cell pellets were dried at room temperature. Ten microliters of 70% formic acid was added, and the mixture was homogenized. Then, the solution was heated at 50 °C for 5 min. Ten microliters of 100% acetonitrile was added and mixed. After centrifugation at 10,000 × g for 2 min, 1 µL of the supernatant was spotted onto MALDI plates and air-dried. One microliter of the same matrix solution was added to the spots and again air-dried. For performing simple FA heating, cell pellets were prepared, and 100% ethanol was added as described above. After centrifugation at 10,000 × g for 5 min, the ethanol was discarded, and the cell pellets were dried at room temperature. Ten microliters of 70% formic acid was added, and the mixture was homogenized. Then, the mixture was heated at 50 °C for 5 min. Ten microliters of 100% acetonitrile was then added and mixed. After centrifugation at 10,000 × g for 2 min, 1 µL of the supernatant was spotted onto MALDI plates and air-dried. One microliter of the same matrix solution was added to the spots and air-dried.

MALDI-TOF MS measurement

MALDI-TOF MS analysis was performed using an AXIMA Performance (Shimadzu/Kratos) mass spectrometer equipped with a pulsed nitrogen UV laser (337 nm) in the positive ion linear mode. Unless specified, at least four mass spectra were acquired for each sample from independent sample spots in the range of m/z 2000–20,000. External mass calibration was carried out using Escherichia coli K12 (CCUG 58987) cell homogenates prepared based on the TFA-beating method. Protein mass peaks for 47 ribosomal proteins, ranging from m/z 4365.34 [M + H]+ to m/z 19888.91 [M + H]+, were used with the calibration function in Shimadzu Biotech Launchpad software (v2.8). For a single MALDI spot, 121 profiles (for each, 5 shots were accumulated) were obtained at different locations within the spot and averaged to generate an output profile. A laser power of 100–110 was used with the option “Pulsed Extraction Optimized at (DA)” of 15,000.0. After measurement, the peak list, which contained a list of detected mass peaks (m/z) and signal intensities, was exported for each MALDI spot using the “Export/ASCII” function in the software.

External MALDI-TOF MS spectrum data

The MALDI-TOF MS peak lists for Cutibacterium acnes strains (118 peak lists for 24 different strains) were obtained in a previous study using an AXIMA Performance (Shimadzu/Kratos) mass spectrometer [45]. Peak lists of Acinetobacter strains (616 peak lists for 74 different strains, SAC001) were provided by NBRC, NITE, Japan.

Database construction

Publicly available bacterial and archaeal genomes, including single-cell genomes (SAGs) and MAGs, were downloaded from NCBI’s FTP site at the time of RefSeq/GenBank 95 release [14]. The same quality control criteria used for GTDB construction [2] were applied to exclude some potentially low-quality genomes from the reference genome dataset. CheckM v1.1.2 [46] was used to estimate genome quality and assembly statistics. For all the genomes, gene identification was performed with Prodigal v2.6.3 [47] with automatic estimation of the best translation table for each genome using biolib v0.1.6 (https://github.com/dparks1134/biolib); translated amino acid sequences for all predicted protein-coding genes in a genome were used for further study. Genomes that were suggested to contain an exceptionally high number of genes per unit of genome size were excluded as an additional quality control, where genomes with a number of predicted genes per coding base in a genome > 0.0018 were flagged. Flagged genomes, except those acting as representative genomes at the species level in the GTDB r95, were removed, resulting in 193,197 genomes (190,160 bacterial and 3037 archaeal genomes) being included in the mass protein database.

For all the retained genomes, associated GTDB and NCBI taxonomies were obtained from metadata files of GTDB r95. Predicted protein sequences for all the genomes were annotated by (i) comparison against UniProKB release 2019_09 [48] using DIAMOND v0.8.36 [49] with default settings, and (ii) annotation of proteins was also performed based on Pfam 27.0 [50] using pfam_search (https://github.com/Ecogenomics/pfam_search) with HMMER v3.1b2 [51]. Ribosomal proteins were identified based on UniProKB annotation strings and the set of Pfam models (Additional file 1: Supplementary Material 1). Resulting mature protein sequences during translocation across the cell membrane were inferred using SignalP v5.0 [52], with the command line flags “-org arch” for archaeal genomes or with both options “-org gram-” and “-org gram + ” for bacterial genomes. Original and mature protein sequences were merged. “Methionine loss” was considered if the first amino acid at the N-terminus was “M” and the second was “G'”, “A”, “S”, “P”, “V”, “T” or “C” [53]. The mass of the resulting proteins was estimated based on the average [M + H]+. The list of mass [M + H]+ values for all the predicted proteins from each genome was generated, only the proteins in the range of 2000 to 15,000 Da (equivalent to their expected mass-to-charge ratio, m/z, in MALDI-TOF MS measurements) were retained and stored as “theoretical protein mass lists.” The list of mass values for potential ribosomal proteins for each genome was also generated and stored separately. The prediction procedure for mass values was written in part of the GPMsDB-tk (GPMsDB-dbtk) scripts developed in this study.

Simulated protein mass peaks

Simulated protein mass lists were generated using Python’s “random” function from the theoretical protein mass lists (in the range of m/z 2000–15,000) for Escherichia coli (GCF_003697165.2), Acinetobacter calcoaceticus (GCF_000368965.1), Bacillus subtilis (GCF_000009045.1), Cutibacterium acnes (GCF_003030305.1), and Methanothermobacter thermautotrophicus (GCF_000008645.1). For each genome, mass values were randomly selected to obtain 50, 100, and 200 peaks per list. In addition, “noisy” mass values were generated using the “random.uniform(2000,15000)” function in python; the random noisy values were mixed with selected theoretical mass values at 0, 20, 40, 60, 80% of the noisy values in the total number of mass values, referred to as “noise (%)”. For each condition, 100 random lists were generated, and peak matching was evaluated at a mass tolerance of 200 ppm.

Data analysis

Matching of protein mass peaks was performed using GPMsDB-tk (v1.0.1, including GPMsDB-dbtk v1.0.1 for building custom databases for user-provided genomes), developed as part of this study. GPMsDB-tk is a software toolkit for assigning taxonomic identification to user-provided MALDI-TOF mass spectrometry profiles. Details of the functionality available in the GPMsDB-tk are shown in Additional file 1: Fig. S7. GPMsDB-tk is a Python package that can be run on a regular laptop using ~ 5 Gb of RAM for a single thread. Peak lists (obtained in this study or from other studies) were imported into the toolkit. Peak matching was performed at a mass tolerance of 200 ppm, unless otherwise specified.

To assign of peak lists to their best match in the database, we defined a peak matching (PM) score based on the number of matched peaks between a queried MALDI-TOF MS peak list and theoretical peak list, as follows:

Scoring scheme I:

where the numerator represents the number of matched mass peaks between the measured query peak list and theoretical reference peak list at a given error tolerance for peak matching, and the denominator represents the number of mass peaks in the theoretical reference peak list. Note that the PM score cannot exceed 1 because mass peaks in the query matching multiple peaks in the reference, and vice versa, are counted only once.

Scoring scheme II:

$$if\ \#\left(peaks\in R\right)\le 800$$

$$if\ \#\left(peaks\in R\right)>800$$

where the denominator is set to 800 (that is, the average number of mass peaks per genome in the database) for reference genomes/peak lists with more than 800 masses.

Scoring scheme III:

$$if\ \#\left(peaks\in R\right)\le 800$$

$$if\ \#\left(peaks\in R\right)>800$$

where w represents the weighting factor for matched mass peaks from ribosomal proteins (set to a default value of 7).

As default, the toolkit uses the scoring theme III; this scoring is used for the identification of the cultured strains, including cultures from the fecal samples. Unless specified, for taxonomic identification based on MALDI-TOF MS measurements, self-mass calibration in the GPMsDB-tk (option -aa) and -m option (-m 0) were used with the option “reps” (reference data with representative genome entries at the species level), “all” (reference data with all genome entries), or “custom” (reference data with all genome entries plus user-provided custom genome data) with the option “peak_bwf” with schoring theme III (“–score_type weighted”). GPMsDB-tk provides not only score values for hit genomes but also probability value for each score. For the calculation of the probability value, 100 non-hit genomes, where 500 or 5000 top hit genomes were excluded for selection for “reps” and “all” genomes, respectively, are randomly selected, and the score is calculated for each genome selected. After collecting the scores from the 100 genomes, the distribution of scores for the given peak list is inferred using “stats.norm.sf” function of python and the frequency of appearance of the given score is calculated as the probability value. For comparative genome analyses, average nucleotide identity (ANI) was calculated with FastANI v1.33 [54] with default settings. Mash v2.3 [15] was used to calculate mash distances with all combinations of the retained 193,197 genomes. Data visualization was performed using seaborn/matplotlib in Python. Phylogenetic trees were generated primarily using GTDBtk v1.4.1 [55] with the r95 database. The GTDBtk “align” function was used with the option “–skip_trimming” to export the amino acid sequence alignment, and the sequences were imported into an ARB database [56]. Then, the sequences of selected operational taxonomic units (OTUs) were exported from ARB with custom masks to eliminate noninformative sites of sequences for phylogenetic inference. FastTree v2.1.10 [57] was used for inferring phylogeny and trees generated in ARB.

留言 (0)

沒有登入
gif