SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark

SQANTI-SIM pipeline

The SQANTI-SIM pipeline is implemented in Python and makes use of R Core Team [31] for specific functionalities such as short-read simulation and evaluating reconstructed transcriptomes. SQANTI-SIM is a wrapper script (sqanti-sim.py) consisting of distinct modules that perform tasks in the following order: (i) the classif stage screens the provided reference GTF, classifying annotated transcripts based on their potential SQANTI3 structural category compared to other transcripts of the same gene; (ii) the design generates a reduced GTF annotation file, excluding novel isoforms according to user-specified novelty types and amounts; (iii) the sim step runs NanoSim, PBSIM3 or IsoSeqSim to simulate reads based on the complete reference annotation, using the expression value distributions provided by the user; and the (iv) eval module assesses the accuracy of the long-read reconstructed transcriptome by running SQANTI3 for the predicted transcripts using the reduced GTF and an additional transcript index file containing the structural annotation of the simulated novel transcripts. For convenience, the classif, design and sim modules can be run in a single command called full-sim, facilitating full dataset simulation. We also provide pre-trained models and characterized datasets generated using the WTC11 human cell line that can be directly used without the need for additional training or user-provided files.

All results presented in this manuscript were generated using version 0.2 of SQANTI-SIM. For a more detailed explanation of SQANTI-SIM, please refer to our GitHub repository: https://github.com/ConesaLab/SQANTI-SIM.

SQANTI-SIM classification

The classif module categorizes reference transcripts using the same classification algorithm as in the latest SQANTI3 version (v5.1.2. at the time of writing this work), providing consistent transcript structural classification across SQANTI3-based tools. Each transcript is assigned a structural category when compared to other transcripts of the same gene. We exclude self-comparisons which would always result in FSM.

SQANTI-SIM classif generates an index file containing reference transcript identifiers, their potential structural categories with the corresponding associated transcript, the associated gene, and other structural features such as length and number of exons. The classification of transcripts into potential SQANTI3 categories is the basis for the selection of novel transcripts for simulation according to user preferences. The resulting file is utilized as input for the subsequent SQANTI-SIM modules.

SQANTI-SIM novelty selection

SQANTI-SIM design performs two main tasks: generating a reduced transcriptome reference annotation and setting expression levels. Reference transcripts are classified as “novel” or “known” based on user-defined requests for each structural category. “Novel” transcripts are removed from the annotation, resulting in a reduced GTF file, while retaining the associated reference gene/transcript information to preserve the previously assigned structural category. Transcripts chosen for removal, simulated as novel, maintain their assigned structural category by preserving the associated reference transcript within the reduced annotation. Meanwhile, the transcripts not deleted will be simulated as known and thus classified as FSM. By default, SQANTI-SIM design targets only transcripts with a minimum length of 200 bp to avoid simulating small RNAs.

After selecting transcripts for simulation, expression values are assigned. SQANTI-SIM provides three alternatives to setting expression values. In the equal mode, the same expression value is assigned to all simulated transcripts, regardless of their novelty type. The custom mode allows for the customization of separate negative binomial distributions for known and novel transcripts, to introduce differences in expression levels between the two transcript types. The sample mode obtains expression values from a real expression distribution profiled from an existing long-read RNA-seq sample. Minimap2 [32] is used to obtain primary alignments to the reference transcriptome that is generated from the genome fasta file and the reference GTF using the gffread tool [33]. Inverse transform sampling is applied to simulate values from the empirical distribution. Minimap2 is run with different presets for Pacific Biosciences (PacBio) reads (“-x map-pb”) and for Oxford Nanopore Technologies (ONT) reads (“-x map-ont”). To maintain the possibility for different expression levels between known and novel transcripts while using the sample mode, the –diff_exp parameter is used in the random sampling process to control the bias between the expression distributions of novel and known transcripts. The –diff_exp parameter influences the odds of assigning high expression values to these two types of transcripts, with higher values resulting in a stronger bias towards known over novel transcripts. Additionally, relying on the primary long-read alignments, SQANTI-SIM roughly estimates the number of transcripts expressed for each gene. The –iso_complex parameter forces SQANTI-SIM to simulate the diversity of transcripts observed, reproducing the isoform complexity of the sample.

Pre-computed empirical cumulative distribution functions (ECDFs) of expression values are provided by default for cDNA and dRNA ONT samples and a cDNA PacBio sample from the WTC11 cell line.

SQANTI-SIM simulation

The SQANTI-SIM sim tool simulates reads from transcripts that were selected at the design stage. For long-read simulation, NanoSim, PBSIM3, or IsoSeqSim are used as detailed in Additional file 3: Section 1. Users can either provide their own training data or used SQANTI-SIM pre-trained models created using the human WTC11 cell line data from the LRGASP project.

NanoSim v3.1.0 is employed for simulating cDNA and dRNA ONT reads. The execution of NanoSim includes the options “-b guppy –no_model_ir” and controls the number of simulated reads by disabling the simulation of randomly unaligned reads. SQANTI-SIM simulates cDNA PacBio reads using either PBSIM3 v3.0.0 (–pbsim) or IsoSeqSim v0.2 (–isoseqsim). PBSIM3 is set to simulate the generation of CLR by multi-pass sequencing using the error models constructed from the provided PacBio reads, and ccs [34] is executed with default parameters to generate HiFi reads. For IsoSeqSim, normal mode is employed, utilizing the 5′ end and 3′ end completeness information from the provided pre-trained PacBio Sequel model, along with the user-defined error rates.

Additionally, SQANTI-SIM simulates matching complementary data such as short-read Illumina data and CAGE-peak data. Illumina RNA-seq reads are simulated using the Polyester R package [35], and expression values are assigned using the same TPMs as for long-read data, resulting in two FASTA files with paired-end reads of 100 nt in length. These simulated reads are generated using an error model uniformly distributed at an error rate of 0.5%.

When used in sample mode, SQANTI-SIM provides a data-driven approach for simulating CAGE-peak orthogonal data based on short-read expression (Additional file 2: Fig. S3). SQANTI-SIM starts by estimating CAGE-peak length and determining the distance from each CAGE-peak center to the nearest transcript TSS using the user-provided CAGE-peal BED file. This results in two separate empirical bivariate distributions (one for length and another for distance) for CAGE peaks that either overlap or do not overlap a transcript TSS.

To predict the presence or absence of a CAGE-peak supporting a given transcript TSS, SQANTI-SIM fits a logistic regression model. The first step involves computing the TSS ratio SQANTI3 metric and the proportion of TSS coverage using BEDTools. The TSS ratio is the ratio of short-read coverage 100bp downstream and upstream of the TSS defined in the SQANTI3 software [19]. The proportion of TSS coverage is computed by considering a 20-bp window downstream of the TSS and dividing by the total number of reads. These values, along with information on the presence or absence of a CAGE peak supporting each transcript TSS (\(Y_i\) for each transcript \(i = 1, ..., n\), where n represents the total number of transcripts), are used to build the logistic regression model as follows:

$$\begin Y_i & \sim Ber(\pi _i), \quad i = 1, ..., n, \nonumber \\ \text (\pi _i) & = \text \left( \frac \right) = \beta _0 + \beta _1 X_i^ + \beta _2 X_i^, \end$$

(1)

where \(\pi _i\) is the probability of having a CAGE peak overlapping the transcript TSSi position, and \(\pi _i\) is linked to the linear predictor by the logit link function. The linear predictor consists of an intercept term \(\beta _0\) and coefficients \(\beta _1\) and \(\beta _2\), which correspond to the log-transformed TSS ratio \(X_i^\) and the log-transformed proportion of TSS short-read coverage \(X_i^\), respectively.

The user can specify the proportion of CAGE peaks that should not align with any TSS or take this from the empirical data. Using the fitted logistic regression model and random sampling from the empirical cumulative distribution functions (ECDFs) of the bivariate distribution, SQANTI-SIM generates a BED file with the simulated CAGE peaks. SQANTI-SIM includes pre-trained models for CAGE-peak simulation obtained with the LRAGSP human WTC11 cell line data. This model achieved 81.2% accuracy in a 10-fold cross-validation

SQANTI-SIM performance evaluation

The SQANTI-SIM eval module is used to assess the accuracy of the transcriptome reconstruction pipeline employed for identifying simulated transcripts. The evaluated transcriptome reconstruction algorithm should predict transcript models using the simulated data and the reduced reference annotation generated by SQANTI-SIM.

The eval step runs SQANTI3 to determine the structural categories of the reconstructed transcripts, utilizing the reduced annotation. Subsequently, SQANTI-SIM evaluates the accuracy of the reconstruction methods by comparing splice junctions and 3′/5′ transcript ends of the reconstructed transcript models with those present in the simulated transcripts. The potential structural categories and structural features of these simulated transcripts are stored in the SQANTI-SIM index file. SQANTI-SIM reports various statistics related to the accuracy of the retrieved data. Transcripts that were removed from the reference annotation should be identified as novel, and any new transcript not simulated by this controlled procedure is considered a false call. Sensitivity (Sn) and precision (Pr) are calculated at the isoform level for both novel and known transcripts and for each structural category separately:

$$\begin \text = \frac} + \text } \qquad \text = \frac} + \text }, \end$$

(2)

where true positive (TP) represents the reconstructed transcript models that match the simulated reference transcript at all splice junctions and have 3′/5′ ends within 50 nt from the annotated TSS and transcription termination site (TTS), false positive (FP) are transcripts that were detected but not simulated, and false negative (FN) indicates transcripts that were simulated but not identified according to the criteria for TP.

SQANTI-SIM also provides F1 scores as a summary measure of accuracy:

$$\begin \text _ \text = \frac^ + \text ^} = \frac} + \frac(\text + \text )}. \end$$

(3)

The evaluation report also includes metrics to evaluate partial matches. Partial true positives (PTP) are transcript models that match a reference transcript at all splice junctions but have a 3′/5′ end located more than 50 nts away from the TSS and/or TTS. Consequently, SQANTI-SIM calculates the positive detection rate (positive detection rate (PDR)) and false detection rate (false discovery rate (FDR)) as follows:

$$\begin \text = \frac + \text } + \text } \qquad \text = \frac} + \text + \text }. \end$$

(4)

For comprehensive characterization of the TP, FN, and FP calls, the SQANTI-SIM report included additional plots describing other structural features. These include transcript length, number of exons, number of simulated reads, and the number of transcripts with canonical or non-canonical junctions. Furthermore, provided with simulated orthogonal data, SQANTI-SIM eval generates descriptive plots, including visualizations of SJ coverage for TP and FP transcripts using short-read data, as well as the TSS support by CAGE peaks.

For an example, of the SQANTI-SIM evaluation report, we refer to https://github.com/ConesaLab/SQANTI-SIM/blob/main/example/example_SQANTI-SIM_report.html.

Real data characterization and dataset simulationPre-trained models and empirical distributions for WTC11 in SQANTI-SIM

SQANTI-SIM includes the default pre-trained models and empirical distributions utilized by the long-read simulators (NanoSim, PBSIM3, and IsoSeqSim), although users have also the option to define custom distributions. For the long-read simulator models, PBSIM3’s default choice is a quality score model derived from PacBio RS II reads provided by PBSIM3 developers. By default, IsoSeqSim is used with the error rates recommended in the IsoSeqSim GitHub repository, which are based on PacBio Sequel data (substitution 1.731%, deletion 1.090%, and insertion 2.204%). For NanoSim, we provide specific profiles and pre-trained models for cDNA-ONT (ENCODE accession ENCFF338WQL) and dRNA-ONT (ENCODE accession ENCFF155CFF) reads acquired from the human WTC11 cell line sequencing data generated by the LRGASP project [16] (see Additional file 3: Section 2.1). The average error rates for these pre-trained models are 8.1% for cDNA (mismatch 2.8%, insertion 1.9% and deletion 3.5%) and 12.2% for dRNA (mismatch 3.6%, insertion 3% and deletion 5.7%).

For expression value assignment at the design stage, SQANTI-SIM offers default empirical distributions of raw counts for various read types. Raw data associated with these distributions can be accessed through the ENCODE database, including cDNA-PacBio (accession ENCFF338WQL), cDNA-ONT (accession ENCFF263YFG), and dRNA-ONT (accession ENCFF155CFF). To obtain these empirical expression distributions, reads were mapped to the reference transcriptome using minimap2 v2.26 (see Additional file 3: Section 2.2), and primary alignments were computed as raw counts.

Additionally, SQANTI-SIM provides pre-trained models for simulating CAGE peaks. To create the models and profiles for CAGE-peak data, SQANTI-SIM requires long-read-defined transcript models and sample-specific short-read and CAGE peak data. SQANTI-SIM uses data from the WTC11 cell line available at the ENCODE database, with cDNA-ONT reads under experiment accession ENCSR539ZXJ, PacBio under accession ENCSR507JOF, short reads under ENCSR673UKZ, and CAGE peaks from GEO accession GSE185917. Transcript models were generated using IsoSeq v4.0.0 and FLAIR v2.0.0 [10], with assistance from matched short-read data (see Additional file 3: Section 2.3.1). Short reads were aligned using STAR v2.7.10b [36] to obtain TSS coverage and the TSS ratio of reconstructed transcript models. Downloaded CAGE-peak calls from GEO were filtered to include only peaks found in at least 2 replicates, resulting in 139,285 CAGE peaks. CAGE peak data and model fitting were characterized using the supplementary script sqanti-sim.py in train mode (see Additional file 3: Section 2.3.2).

PacBio and ONT datasets simulation

SQANTI-SIM validation datasets

For validating the SQANTI-SIM pipeline, we simulated cDNA ONT and PacBio reads including transcript models for all the SQANTI3 structural categories. A total of 50,000 isoforms were simulated, with 1000 assigned to each novel structural category (Additional file 1: Table S1). The simulation was executed using SQANTI-SIM in full-sim sample mode with default –long_count parameter. We used the –iso_complex parameter to validate the simulation of the number of simulated transcripts per gene (see Additional file 3: Section 3). Short-read data was simulated with various sample sizes (20M, 40M, and 60M short reads) using Polyester. The 20M short-read dataset was used to simulate CAGE-peak data.

Data simulation for pipeline benchmarking

The utility of SQANTI-SIM as a benchmarking tool was demonstrated by assessing the performance of different transcriptome reconstruction algorithms and pipelines. For this assessment, we simulated a cDNA ONT and PacBio dataset consisting of known isoforms (FSM) and novel transcripts falling into the three main SQANTI3 structural categories (ISM, NIC, and NNC) (see Additional file 3: Section 4.1).

Datasets were simulated using the GENCODE H. sapiens (GRCh38.p13) reference genome and GENCODE annotation v43. For both the PacBio and ONT datasets, SQANTI-SIM was executed in –full-sim sample mode, generating 50,000 different isoforms, out of which 15,000 were novel (5000 ISM, 5000 NIC, and 5000 NNC). To simulate the ONT dataset, we used the parameters “–ont –illumina –long_count 20000000 –short_count 40000000 –read_type cDNA”, which resulted in the simulation of 20 million ONT long reads and 40 million Illumina reads. For the PacBio dataset, we used the parameters “–pb –pbsim –illumina –long_count 4000000 –short_count 40000000” to simulate 4 million PacBio reads with PBSIM3 and 40 million Illumina reads. We set the option “–iso_complex” to approximate the isoform complexity of the profiled real data, and “–diff_exp 2” to simulate a bias of lower expression for novel transcripts compared to known transcripts. We used the default provided pre-trained models and ECDFs from the WTC11 cell line in the simulation.

Long-read-defined transcriptome reconstruction methods

The performance of IsoSeq, IsoSeq+SQ3Rules filter, FLAIR, FLAIR+Illumina, and TALON was evaluated with the simulated datasets described in the previous section.

As IsoSeq [12] is designed to identify isoforms from reads sequenced on the PacBio sequencing platform, only the PacBio simulated data was processed with IsoSeq. The simulated HiFi reads were clustered using the isoseq cluster2 mode. Then, reads were mapped to the reference genome using pbmm2 v1.12.0 with the ISOSEQ preset. Finally, the alignments were collapsed using isoseq collapse with –do-not-collapse-extra-5exons option.

For the IsoSeq+SQ3Rules pipeline, the SQANTI3 v5.1.2 Rules filter was applied to the IsoSeq transcript models with the following parameters: Isoforms flagged for intrapriming or RT-Switching were discarded. FSM and ISM isoforms were required to have TSS support either from CAGE-peak overlap, having a TSS ratio of 1.5, or close proximity to annotated TSS (< 50 bp). Additionally, other structural categories also required all splice-junctions to be supported by at least 2 short-reads or to be canonical junctions.

FLAIR [10] was applied to process both the ONT and PacBio datasets. Since FLAIR supports short-read splice data, it was run in two modes: (1) using only long reads and the “reduced” GTF (FLAIR pipeline) and (2) using long reads, the “reduced” GTF, and short-read splice sites for correction (FLAIR+Illumina pipeline). Short reads were mapped using STAR, and splice junctions with fewer than 3 supporting short reads were filtered from the resulting SJ.out.tab file. FLAIR was executed in the 123 mode, which performs alignment, correction, and collapse in a single step, using the “–check_splice” parameter. Additionally, splice junction data was provided using the “–short_reads” parameter when incorporating short-read simulated data.

Finally, TALON [11] was run on the ONT and PacBio mapped reads, guided by the “reduced” GTF. For the TALON pipeline, a database with the “reduced” GTF was generated using “talon_initialize_database” with the default settings. Then, the reads were mapped to the reference genome using minimap2 (-ax splice:hq -uf –MD), and “talon_label_reads” was used with the default settings to flag reads for internal priming. Next, the “talon” script was run to annotate the reads and novel transcript models were filtered using “talon_filter_transcripts”. Finally, the reconstructed transcriptome was built using “talon_create_GTF”.

Code details for all pipelines are provided at the Additional file 3: Section 4.2.

留言 (0)

沒有登入
gif