Development of a predictive model to distinguish prostate cancer from benign prostatic hyperplasia by integrating serum glycoproteomics and clinical variables

Materials

All chemicals used in the experiments described in this section were purchased from Sigma-Aldrich unless otherwise specified.

Sample collection

Blood samples were obtained from the Urology Units of Romolo Hospital (Kr) and Magna Graecia University of Catanzaro. Specimens were collected from PCa before radical prostatectomy and any therapeutic treatment, while patients suffering from BPH were recruited as controls. Inclusion criteria were: prostate biopsy performed at least 4 weeks prior to recruitment (with a minimum of 12 sampling). The exclusion criteria were: previous prostatic surgery, radiotherapy of the pelvis, neoadjuvant anti-androgenic therapy, therapy with 5-alpha reductase inhibitors.

Quantification of proPSA on serum samples was performed by enzyme-linked immunosorbent assay (ELISA) using the MYBioSource kit.

Discovery experiments

Three different discovery experiments were carried out in order to expand as much as possible the list of potential candidates to be validated by PRM. In particular, these multiple discovery experiments were performed starting from the same 40 digested serum samples (20 PCa and 20 BPH) and are referred in the following sections as “Discovery TMT-A”, “Discovery TMT-B” and “Discovery 3D”. Discovery TMT-A and TMT-B are both based on isobaric labelling but differ in where TMT derivatization was performed in the workflow: in TMT-A, labelling was performed after glycopeptide enrichment by TiO2, whereas in TMT-B it preceded the enrichment procedure. On the contrary, Discovery 3D experiments were performed in label free mode to construct a database of MS spectra for verification experiments.

Protein digestion for discovery experiments

In-solution digestion was performed on 40 serum samples (20 PCa and 20 BPH). Briefly, 25 µL of each sample were diluted with 225 µL of 100 mM triethylammonium bicarbonate (TEAB)/2.5% sodium deoxycholate (DOC) (w/v). Then, protein disulphide bonds between cysteine residues were reduced by adding 25 µL of 100 mM dithiothreitol (DDT) and incubating the samples for 5 min at 95 °C, then for 60 min at 37 °C with gentle agitation. Cysteine residues were alkylated by 24 mM (final concentration) iodoacetamide (IAA) for 60 min at 37 °C with gentle agitation. Then, IAA excess was quenched by an extra 2 mM DTT (incubation of 30 min at 37 °C). Finally, 124 µL of each sample, corresponding to 10 µL of undiluted serum, were withdrawn and mixed with 365 µL of 50 mM TEAB in order to lower DOC concentration down to 0.5%. Samples were digested overnight with 8 µg of trypsin using a 1:100 E/S ratio (37 °C at 650 rpm).

Discovery TMT-A

Hundred microliters of each digested specimen (about 160 µg) were pooled in groups of 4 for a total of 10 sample pools (5 PCa and 5 BPH). DOC was removed by precipitation (Additional file 1). Then,

glycosylated peptides were enriched by the use of TiO2 beads following the protocol of Palmisano and co-workers [17].

The 10 sample pools were labelled by Tandem Mass Tags (TMT-10 plex, Thermo Fisher). TMT labelling was performed following the manufacturer’s protocol except for the resuspension volume of TMT reagents, which was 100 µL of anhydrous ACN (final TMT concentration of 0.8 µg/µL) (Additional file 1).

TMT-labelled sample pools were combined in 1:1 ratio into a single sample. This sample was fivefold diluted in Wash B (80% ACN/0.5% formic acid (FA) (v/v)) and then fractionated by strong cation exchange (SCX) StageTip (Additional file 1) [18]. Then, the 10% of each fraction was analyzed by nanoliquid chromatography-tandem mass spectrometry (nLC-MS/MS).

Discovery TMT-B

Twenty-five µL of digested samples (about 40 µg) were pooled in groups of 4 samples for a total of 10 pooled samples (5 PCa and 5 BPH). Subsequently, peptides were labelled as described in Additional file 1. After having verified that the labelling reaction was complete, by injecting a small aliquot of each sample in nLC-MS/MS prior to quenching [19], the labelling reaction was quenched by hydroxylamine. Then, all samples were combined in 1:1 ratio into a single sample mix (about 1.6 mg in a volume of 12 mL).

Labelled peptides were separated from the detergent by acid precipitation followed by solid-phase extraction (SPE) (Additional file 1).

By virtue of a higher quantity of peptide starting material, TiO2 enrichment was performed using 10 mg of beads. Washings and elution were performed as described in Additional file 1, section “Glycopeptide enrichment”. Then, glycopeptides were de-glycosylated by the addition of 6 µL of PNGase F (overnight incubation at 37 °C with gentle agitation).

Formerly glycosylated peptides were separated in 10 fractions by C18 (Empore™-3M, C18) StageTips performed at basic pH (Additional file 1). Then, 25% of each fraction was analyzed by nLC-MS/MS.

Discovery 3D

Discovery 3D experiments were performed to create a database of MS/MS spectra for the subsequent verification phase. The experiments were performed using solely PCa specimens (20 samples), since protein identification was mainly directed towards hits potentially increased in PCa. From each sample digest, obtained as previously described, 175 µL of solution were withdrawn and pooled (total volume was 3.5 mL, total peptide amount 5 mg). DOC was precipitated using TFA 0.5% and the supernatant was withdrawn and purified by SPE HLB (3 cc) as described in Additional file 1 section “High pH C18 fractionation”. After SPE purification, the obtained eluate was lyophilized. The SPE eluate was resuspended in Titanium Loading Buffer and glycopeptides were enriched as described in Additional file 1, section “Glycopeptide enrichment”. In this case, 25 mg of TiO2 beads were used. Finally, enriched glycopeptides were incubated with 10 µL of PNGase F to remove carbohydrate moieties.

Formerly N-glycosylated peptides were fractionated in 10 fractions by Basic pH fractionation as described in in Additional file 1 section “High pH C18 fractionation”, using an increased amount of stationary phase in order to accommodate the higher amount of material. In particular, three different StageTips, each packed with 3 Empore C18 disks were used. Each of the 10 fractions was divided further into 5 additional fractions by SCX using the procedure described in Additional file 1 section “Strong cation exchange (SCX) StageTip”. Fractions 7, 8 and 9 eluted from the basic pH C18 StageTips were combined in a single fraction because of their low peptide content. After this procedure, 40 fractions in total were obtained; 25% of each fraction was processed by nLC-MS/MS.

nLC-MS/MS analysis of discovery experiments

All the fractions from Discovery Experiments were analyzed by tandem mass spectrometry in data-dependent acquisition mode (DDA). Briefly, chromatographic separation was performed by nanoflow chromatography using EASY-LC-1000 instrument (Thermo Fisher) coupled with a Q-Exactive mass spectrometer (Thermo Fisher). Peptides were separated by an in-house made analytical column packed to 14 cm of length with 3 μm C18 silica particles (Dr. Maisch). Gradient elution was obtained using a binary gradient of 140 min at a flow rate of 300 nL/min. The mobile phase A and B were (2% ACN/0.1% FA (v/v)) and (80% ACN/0.1% FA (v/v)) respectively. The percentage of mobile phase B was increased from 0 to 6% in 1 s, then to 38% in 120 min and to 100% in 15 min. After 5 min at 100%, mobile phase B was then decreased to 0% in 2 min. MS detection of peptides gradually eluted from the analytical column, was performed by nanoelectrospray (nESI) applying a potential of 1700 V to the column front-end via a tee piece. DDA was performed by using a top-12 method, where the 12 most abundant ions were automatically selected for HCD fragmentation (collision energy was set at 34% for TMT experiments and at 25% for the Discovery 3D experiment).

Resolution, AGC target and maximum injection time (ms) for full MS and MS/MS were 70 000/35000, (1 × 106)/(2 × 105), 50/120, respectively. MS full scan range was 350 − 1800 m/z. Mass window for precursor ion isolation was 1.6 m/z. Ion threshold for triggering MS/MS events was 1 × 105. Dynamic exclusion was 30 s.

Data analysis of discovery experiments

The raw files from TMT-A experiments were analyzed with Proteome Discoverer (v. 2.1) using Sequest HT as search engine. Search parameters were the following: MS tolerance 15 ppm; MS/MS tolerance 0.02 Da. Trypsin was selected as an enzyme and two missed cleavage sites were allowed. TMT labelling of lysines and N-terminus (+ 229.163 Da), deamidation of asparagines (+ 0.984 Da), and oxidation of methionines (+ 15.995 Da) were set as variable modifications, while carbamidomethylation of cysteines (+ 57.021 Da) was set as fixed modification. Only peptides harboring glycosylation consensus sequence (NXT/S) and fully labelled were kept for subsequent statistical analysis.

Data from TMT-B and Discovery 3D experiments were analyzed with MaxQuant (v. 6.2). The following parameters were used: enzyme trypsin, maximum 2 missed cleavages, MS tolerance 3.5 ppm after recalibration and MS/MS tolerance of 20 ppm. The dynamic modifications were: methionine oxidation (+ 15.995 Da), asparagine deamidation (+ 0.984 Da), TMT labelling of lysines and N-terminus (+ 229.163 Da) (for TMT-B). Carbamidomethylation of cysteines (+ 57.021 Da) was set as static modification.

The Human Uniprot protein sequence database accessed on 15 November 2017 was used as sequence database (20184 entries).

Statistical analysis of Discovery TMT-B data was performed with Perseus software. Protein intensities were log2 transformed and normalized based on the median value of all intensities. Differentially expressed proteins were filtered based on p-value < 0.1 and a fold-change > 1.1 and presence of glycosylation consensus (NX/T).

Results Discovery 3D were analyzed only from a qualitative point of view interpolating the list of identifications with BioGPS (www.biogps.org) and with candidate lists selected from the literature16,18 in order to identify proteins involved in PCa development.

LC-PRM assay development

In this phase, the analytes selected in the discovery experiments were quantified in targeted mode by PRM in label free mode. Proteomic analysis was carried out on 53 specimens (27 BPH, 26 PCa). Serum samples were digested, DOC was precipitated and, then glycopeptides were enriched by TiO2 enrichment and purified by C18 as described in the section relative to the discovery experiments.

PRM Quantification of formerly N-linked glycopeptides

Discovery experiments resulted in a list of 34 formerly N-linked glycopeptides (belonging to 31 proteins) of interest for PRM quantification in label free mode in individual samples. The selected candidates and the relative proteins are illustrated in the Table 1.

Table 1 Candidates tested by PRM and their blood concentration according to the Human Protein Atlas.LC–MS method settings (PRM) and data analysis

Formerly N -linked glycopeptides were analyzed using the method described in Additional file 1 section “LC-PRM acquisition method”.

Data sets were imported into Skyline v. 19.1 and peaks were automatically integrated and manually inspected. For the quantification of the 34 selected peptides (35 precursors, since one peptide was also detected in its oxidized form), MS/MS spectra from 3D Discovery Experiments were used to build a spectral library of TiO2-enriched serum. The charge states of precursors were set to 2, 3 and 4, and the product ions were 1-, 2- and 3-charged (ion types y, b, p) with a up to 6 product ions. The ion match tolerance was set to 0.05 m/z [20].

Preparation of heavy peptides for validation experiment

The Heavy peptides containing either 13C6 + 15N2 atoms (Lys) or 13C6 + 15N4 atoms (Arg) at the carboxy terminal amino acid were bought from JPT Peptide Technologies (Berlin, Germany, Additional file 2: Table S1). These peptides were solubilized in 40% ACN /0.1% FA v/v; for most hydrophobic peptides, 70% ACN instead of 40% ACN was used (Additional file 2: Table S1). To test their purity and to optimize chromatographic conditions, heavy peptides were individually injected in nLC-MS/MS. After the completion of PRM verification experiments, a “heavy” peptide mixture (HPM) matching the expected relative concentrations of endogenous peptides was created. In order to obtain the HPM, concentrated peptide solutions were diluted in 40% ACN/0.1% FA until a concentration value 100-fold higher than what reported in Table 2 Si was obtained; resulting peptide solutions were then mixed together and diluted 100-fold in 10% ACN/0.1% FA. HPM was stored in 10 µL aliquots at -80 °C until its use as internal standard. The concentration of each heavy peptide in the HPM is illustrated in Table S2 (Additional file 2). HPM was added to each sample after protein digestion and N-glycopeptide enrichment, thus before C18 purification and nLC-MS/MS.

Table 2 Principal 11 significant variables after feature selectionVerification experiments

This phase was focused on the analysis by PRM of the selected candidates by nLC-MS/MS in targeted mode by using isotopically labelled peptides as internal standards. This subset of experiments was performed in duplicates on an independent subset of 79 PCa and 84 BPH patients.

Ultimate sample processing workflow and PRM analysis

Samples were processed as described in the discovery experiments, making only minor changes to the original protocol (Additional file 1 section “Sample processing workflow”).

The whole pipeline was carried out in duplicate for 84 BPH and 79 PCa serum samples, accomplishing a total of 326 nLC-MS/MS analyses.

nLC-MS/MS analysis was performed using the acquisition method described in Additional file 1 section “Ultimate LC-PRM acquisition method”.

A schematic view of the PRM method is reported in Additional file 2: Table S3.

Data analysis

The variability of the glycopeptide enrichment procedure was corrected through introducing a normalization factor based on the quantification, by extracted ion chromatogram (XIC), of 30 highly abundant serum glycopeptides (Additional file 2: Table S4). The selection criteria for the 30 glycopeptides used for normalization were: high concentration and no involvement in inflammation.

Sample replicates were evaluated for their concordance. As criterion, the “scaled relative difference” (SRD) introduced by Hyslop and White in 2009 was chosen [21]. The scaled relative difference can be defined by the following formula: (Ci1-Ci2)/Ci√2

where Ci1 and Ci2 represent the sum of all XIC values for the 30 reference glycopeptides in replicates 1 and 2, respectively, whereas Ci is the average of the two measures. SRD higher than 0.50 (or lower than − 0.50), indicating a difference in glycopeptide abundance between the two replicates higher than twofold, was considered not acceptable. In this case, the replicate with the lower recovery of glycopeptides was discarded. Besides, nLC-MS/MS runs having a Ci1 or Ci2 value 2 standard deviations lower than the average Ci in the data set were also excluded (Additional file 3: Table S5). After this preliminary filtering operations, 131 duplicate analyses and 32 single analyses, respectively, were subjected to multivariate analysis.

Multivariate analysis

The calculated areas of light peptides were corrected by the IS signal (heavy peptide) as follows:

Ln’ = f * Ln.

where Ln is the area obtained for the n-th peptide light, Ln’ is the corrected area, and f is the correction factor obtained with the formula below:

f = Hm / Hn.

where Hn is the area of the n-th peptide (heavy form) and Hm is the average value of the n-th heavy peptide in the overall sample set.

After being corrected, the areas of endogenous peptides were normalized using the normalization factor, considering the enrichment efficiency. Peptide areas were divided by: Ci1 (or 2) / Ca, were Ci1 and Ci2 represent the sum of all XIC values for the 30 reference glycopeptides in replicate 1 and 2, respectively, whereas Ca represents the average value obtained for the entire data set. Finally, for each sample having a technical duplicate, the average between the two replicates for each peptide was calculated. The whole data matrix after peptide normalization is reported in Table S6 (Additional file 4).

Clinical variables (Additional file 4: Table S7) together with mass spectrometric results (32 peptide areas) were filtered by feature selection exploiting different approaches. In particular, Random Forest, Chi-square test, Pearson coefficient, Lasso regression and Recursive feature elimination have been used. According to each model’s metrics, the feature selection identifies the most statistically important characteristics and ranks them according to relevance score. The linear correlation between two attributes is measured by the Pearson correlation coefficient [22]. The Pearson correlation coefficient given two random variables X and Y, is the ratio of their covariance to the sum of their respective standard deviations. To test the independence of two events, the Chi-square is used [23]. The test examines the difference between the observed count and predicted count given two factors. The observed count is close to the expected count when two variables are independent, which lowers the Chi-square value. The Recursive Feature Elimination (RFE) [24] is then added to the feature selection module in order to fit the model and eliminate the worst features. By iteratively removing features, the RFE enables to decrease the collinearity that already exists in the supplied data. RFE enables us to recursively reduce features by examining data that show their relative relevance. Random Forest (RF) ensures good data abstraction results also because it is easy to calculate the relative value of each feature on the produced decision tree. Several random decision trees with nodes containing binary questions depending on a single or a combination of features are generated by RF. The tree splits the dataset into two subsets at each node. The effectiveness of each feature, or group of features, in dividing the dataset is then taken into account when determining its relevance. For each test, the maximum number of significant variables has been limited to 20. Table 2 illustrates, in decreasing order of significance, the 11 variables of higher interest.

The list was filtered leaving variables significant for at least 4 of the algorithms (i.e. the first 11 variables) giving the following set: pro-PSA, Free PSA/Total PSA, Gland volume, Total PSA, Free PSA, VQPFNVTQGK (LAMP2), NINYTER (GPLD1), LHINHNNLTESVGPLPK (LUM), DGQLLPSSNYSNIK (NCAM1), SNSSMHITDCR (RNASE1), NNLTTYK (MASP1). Then, the sample set was divided into two groups: 143 samples were used to build the predictive model, whereas the remaining 20 samples were used to evaluate the performance of the model by using a voting strategy [25].

In particular, concerning model creation, 100 out of 143 samples were used as training set (70% of the dataset) and the remaining 43 samples (30% of the dataset) as testing set. The algorithm showing the highest predictive performance (Random Forest), was selected by considering the highest AUC and also the highest sensitivity score, which is one of the most relevant measures in clinical applications, since it gives an idea of the ability of the model to minimize false negatives.

Lastly, to evaluate the performance of the predictive model, ML algorithms results were integrated by implementing a voting strategy. In particular, two voting strategies were employed to merge the outcomes of all the algorithms: hard voting and soft voting. Hard voting counted ML models that agreed on the predicted classes. More specifically, if 4/5 ML models agreed on PCa for a certain input, the hard voting strategy returned PCa as result. Whereas, concerning the soft voting approach, each ML model prediction (i.e., PCA or BPH class) was weighted by the F1 performance measure. The voting strategy was applied for the classification of 20 patients belonging to the diagnostic grey zone (tPSA 4–10 ng/mL).

A separate data analysis focused only on PCa dataset was also conducted to evaluate the possibility to distinguish high grade (AG-PCa) from low grade tumors (NAG-PCa). PCa sample set was divided in two subgroups: 53 AG-PCa (Gleason > 3 + 3) and 26 NAG-PCa (Gleason 3 + 3). The principal contributing variables were triaged by feature selection step. Model testing was performed on 55 samples (70% of the PCa sample set). The selected variables, ranked on their relative contribution to the model, were: FCN3, proPSA, LGALS3BP, AZU1, C6, LAMB1, CHL1, POSTN. The model was tested with the remaining 30% of the sample set (24 data samples).

留言 (0)

沒有登入
gif