At present, the computational search of biologically active compounds is a necessary stage in drug development [1]. The SAR (Structure-Activity Relationship) methods are widely applied in this area [[2], [3], [4], [5]]. The common task is to predict the interaction of the studied ligands and target biomolecules. The SAR model for a certain target should include a sufficient number of ligands to provide reliable training and validation. It may not always be feasible, for instance, when analyzing the molecular targets of a newly identified pathogenic microorganism. An approach known as proteochemometrics (PCM) allows one to overcome this limitation. In this case, target proteins are also represented by their descriptors alongside the ligand descriptors, providing the single model for a whole dataset. Because the training set for PCM model is larger than the training set of each individual SAR model it is intuitive that PCM might have some advantage over SAR in prediction quality.
The PCM methods are characterized by a wider applicability domain compared to SAR. Virtual screening of protein-ligand interactions can be performed by four scenarios in accordance with the filling of an interaction matrix, in which the rows correspond to ligands and the columns to targets (Fig. 1) [6].
S0: Search for new interactions where each ligand and each protein are represented by some number of interactors. Thus, each row or each column contains some number of the filled cells, while a researcher would like to fill all the empty cells.
S1: Prediction of the activity of a new ligand towards the targets with the known ligand spectra. The ligand column is empty at the beginning and should be filled by the end of the study.
S2: Prediction of the activity of a new protein towards the ligands with the known target spectra. The ligand row is empty at the beginning and should be filled by the end of the study.
S3: Prediction of the interaction of a new ligand and a new protein. The cells at the intersection of rows and columns are empty at the beginning and should be filled by the end of the study.
It should be noted that the scenarios S2 and S3 can be implemented only with PCM modeling. The S1 scenario is typical for SAR models based on the structural description of ligands with protein identifiers as class-forming features. At the same time, several researchers use PCM methods for the scenario S1 with a description of both interacting components, without providing convincing arguments about the advantage of PCM over SAR [[7], [8], [9]]. There are some findings that explicitly claim that PCM has some advantage over SAR [8,9]. Many researchers do not indicate the validation scheme, not allowing evaluation of the predictive efficacy when implementing various prognosis scenarios [[10], [11], [12], [13], [14], [15]]. For this reason, it is especially important to develop a suitable validation scheme to correctly compare SAR and PCM models in their performance.
Currently, several techniques are used to validate PCM models. The most widely used are k-fold cross-validation [7,[9], [10], [11]], and Venetian blinds cross-validation [[12], [13], [14], [15]]). These methods involve partitioning the entire set of protein-ligand pairs into the training and test subsets without adhering to any of the above-mentioned virtual screening scenarios. It does not guarantee that each of the studied proteins as well as each of the studied ligands are saved in the training set after the next fold. Cross-validation is sometimes performed in combination with an external test set extracted from the training set before selecting hyperparameters [9,10]. Using this data set, relative universality of the selected hyperparameters of the model is demonstrated.
The LOO (Leave-One-Out) methods use sampling the pairs containing the same component at each validation fold. The LOCO technique [7,8] (Leave-One-Compound-Out) forms the set from all pairs containing the excluded ligand, and the model is trained on the remaining data. This method corresponds to the S1 scenario of virtual screening. The LOTO technique [7,8,10] (Leave-One-Target-Out) is similar to LOCO, but the exclusion is made for targets satisfying the S2 scenario. The LOO validation guarantees the representation of all proteins and ligands at each fold. The LOO methods are computationally expensive on large data sets, requiring a separate model for each ligand or target. One solution is to cluster the ligand and exclude pairs containing the ligand of the same group. The LOSO (Leave-One-Scaffold-Out) method groups ligands by molecular scaffolds [10,11], while the LOCCO (Leave-One-Compound-Cluster-Out) method uses clustering by distance in feature space. These techniques correspond to the S1 scenario.
In this study, we compared the SAR and PCM models for their predictive performance under the S1 scenario, paying particular attention to the validation procedure suitable for such a comparison.
Comments (0)