Biomedicines, Vol. 11, Pages 67: GraphATT-DTA: Attention-Based Novel Representation of Interaction to Predict Drug-Target Binding Affinity

1. IntroductionDrug development is a high-risk industry involving complex experiments, drug discovery, and pre-clinical and clinical trials. Drug discovery is the process of identifying new candidate compounds with potential therapeutic effects, and it is essential for identify drug-target interactions (DTIs). Moreover, the drug-target binding affinity (DTA) provides information on the strength of the interaction between a drug-target pair. However, as there are millions of drug-like compounds, it can take years and costs about 24 million US dollars for experimental assays of target-to-hit process for a new drug [1]. Efficient computational models for predicting DTA are urgently needed to speed up drug development and reduce resource consumption.There are several computational approaches to predicting DTA [2,3]. One is the ligand-based method, which compares a query ligand to known ligands based on their target proteins. However, prediction results become unreliable if the number of known ligands with target proteins is insufficient [4]. Another approach is molecular docking [5], which simulates the binding of the conformational spaces between compounds and proteins based on their three-dimensional (3D) structures. However, it is too challenging to produce the 3D protein-ligand complex. Another approach is the chemogenomic method [6] that integrates the chemical attributes of drug compounds, the genomic attributes of proteins, and their interactions into a unified math framework.In feature-based chemogenomic methods, drug-target pairs are taken as input, and their binding strength or whether to interact, determined by regression or binary classification, are output [7,8]. Efficient input representation is key to accurate prediction. The commonly used drug descriptor is chemical fingerprints such as Extended Connectivity Fingerprint [9] or Molecular ACCess System [10]. The commonly used protein descriptor is physicochemical properties, such as amino acid composition, transition, and distribution. On constructed features, random forest, support vector machine, and artificial neural network models are applied to predict these interactions [11]. Similarity information is also used for representation [12,13,14,15]. KronRLS [14] constructs the similarity between drugs or between target proteins with compound similarity and Smith-Waterman similarity. SimBoost [15] constructs features for each drug, target, and drug-target pair from the similarity. However, the fixed lengths of manually selected features may result in the loss of information.Recently, data-driven features learned during training using large datasets have been shown to increase DTI prediction performance [16,17,18,19,20,21,22,23,24]. DeepDTA learns the representation of drugs and proteins with one-dimensional (1D) convolutional neural network (CNN). However, this leaves the molecule’s original graph structure unaddressed. To cover this, GraphDTA [17] represents a molecule as a graph in which a node is an atom, and an edge is a bond. Graph neural networks (GNN) are used for molecular representation and 1D CNNs are used for protein representation. Additionally, DGraphDTA [18] represents a protein as a contact map followed by graph convolutional network (GCN) embedding to learn DTA using protein structure. However, when modeling DTA interactions, these models consider only the global interactions between compounds and proteins.Furthermore, several studies [20,21,22,23,24] have introduced attention mechanisms to better model the interactions between drugs and proteins for DTA prediction. DeepAffinity [20] introduced an attention mechanism used to interpret predictions by isolating the main contributors of molecular fragments into their pairs. ML-DTI [21] propose the mutual learning mechanism. It takes input as Simplified Molecular-Input Line-Entry System (SMILES) and amino acid sequences, and 1D CNNs are used for encoding. It leverages protein information during compound encoding and compound information during protein encoding, resulting in a probability map between a global protein descriptor and a drug string feature vector. MATT-DTI [22] proposes a relation-aware self-attention block to remodel drugs from SMILES data, considering the correlations between atoms. With this, 1D CNN is used for encoding. The interaction is modeled via multi-head attention, in which the drug is regarded as a key and the protein as a query and value. HyperAttentionDTI [23] uses a hyperattention module that models semantic interdependencies in spatial and channel dimensions between drug and protein sub-sequences. FusionDTA [24] applies a fusion layer comprising multi-head linear attention to focus on important tokens from the entire biological sequence. Additionally, the protein token is pre-trained with a transformer and encoded by bidirectional long short-term memory (BI-LSTM) layers. Although these studies successfully apply attention mechanisms for DTA prediction, they are limited because they learn from less informative input features that do not consider the essential regions needed to determine interaction affinities [16,20,21,22,23,24]. Therefore, in this paper, we propose GraphATT-DTA, an attention-based drug and protein representation neural network that considers local-to-global interactions for DTA prediction (Figure 1). The molecular graph of the compound and protein amino acid sequences are the initial inputs. A powerful GNN model is used for compound representation, and 1D CNNs are used for protein representation. The interactions between compounds and proteins are modeled with an attention mechanism by capturing the important subregions (i.e., substructures and sub-sequences) so that the fully connected layer can predict the binding affinity between a compound and its target protein. We evaluate the performance of our model using the Davis kinase binding affinity dataset and the public, web-accessible BindingDB database of measured binding affinities. GraphATT-DTA’s prediction performance is then compared with state-of-the-art (SOTA) global and local interaction modeling methods. 2. Materials and Methods 2.1. DatasetIn this study, our proposed model and comparison baselines were trained with the Davis dataset [25] and evaluated for external validity with the BindingDB dataset [26]. Table 1 and Supplementary Table S1 provide a summary. The Davis dataset contains the kinase protein family and relevant inhibitors with dissociation constants Kd, whose value is transformed into the log space as

The BindingDB dataset is publicly accessible and contains experimentally measured binding affinities whose values are expressed as Kd,  Ki, IC50, and EC50 terms. For the external test, we extracted drug-target pairs in which the protein is human kinase, and the binding affinity is recorded as a Kd value. These values are then transformed into the log space as described.

The Davis dataset consists of six parts. Five are used for cross-validation and one is used for testing. We use the same training and testing scheme as GraphDTA. The hyperparameter is tuned using five parts with five-fold cross-validation. After tuning the hyperparameter, we train all five parts and evaluate the performance with one test part. To evaluate the generalizability of the model, BindingDB is used as the external test dataset.

2.2. Input Data RepresentationGraphATT-DTA takes SMILES as the compound input and amino acid sequence string as the protein input. First, the SMILES string is converted to a graph structure that takes atoms as nodes and bonds as edges using the open-source Deep Graph Library (DGL) v.0.4.3(2) [27], DGL-LifeSci v.0.2.4 [28], and RDKit v.2019.03.1(1) [29]. We used the atomic feature defined in GraphDTA (i.e., atom symbol, number of adjacent atoms, number of adjacent hydrogens, implicit values of the atoms, and whether the atom is in aromatic structures). We leverage the bond feature used by the directed message-passing neural network (DMPNN; i.e., bond type, conjugation, in the ring, stereo). Table 2 and Table 3 list detailed information for each feature. Each amino acid sequence type is encoded with an integer and cut by a maximum length of 1000. If the sequences are shorter than the maximum length, they are padded with zeros. The maximum length can cover at least 80% of all proteins. 2.3. Drug Representation Learning ModelThe molecule is originally represented by a graph structure consisting of atoms and bonds. The GNN uses its structural information and applies a message-passing phase consisting of message_passing and update functions. In the message_passing function, node v aggregates information from its neighbor’s hidden representation, hw(t). In the update function, it updates the previous hidden representation, hv(t), to a new hidden representation, hv(t+1), using messages mv(t+1) and the previous step of hidden representation, hv(t):

mv(t+1)= message_passing()

(2)

hv(t+1)=update(mv(t+1), hv(t))

(3)

where N(v) is the set of the neighbors of v in graph G, and hv(t) follows time step t of initial atom features, xv. This mechanism, in which atoms aggregate and update information from neighbor nodes, captures information about the substructure of the molecule. GNN models have variants, such as the graph convolutional network (GCN) [30], graph attention network (GAT) [31], graph isomorphism network (GIN) [32], message-passing neural network (MPNN) [33], and directed message-passing neural network (DMPNN) [34], which can be leveraged by specifying the message_passing function, mv(t+1), and update function, hv(t+1) (see Table 4). The output is a drug embedding matrix, D∈ℝNa×d, where Na is the number of atoms, and d is the dimension of the embedding vectors. In the drug embedding matrix, each atom has the information of its neighbor atoms (i.e., substructure) along with the number of GNN layers. 2.4. Protein Representation Learning Model

The Davis and BindingDB datasets have 21 and 20 amino acid types, respectively. Hence, we consider 21 and 20 amino acids for learning and testing, respectively. The integer forms of protein amino acid sequences become the input to the embedding layers. These are then used as input to three consecutive 1D convolutional layers, which learn representations from the raw sequence data of proteins. The CNN models capture local dependencies by sliding the input features with filters, and their output is the protein sub-sequence embedding matrix, S∈ℝNs×d , where Ns is the number of sub-sequences. The number of amino acids in a sub-sequence depends on the filter size. The larger the filter size, the greater the number of amino acids in the sub-sequence.

2.5. Interaction Learning ModelThe relationship between the protein and the compound is a determinant key for DTA prediction. The attention mechanism can make the input pair information influence the computation of each other’s representation. The input pairs can jointly learn a relationship. GraphATT-DTA model constructs the relation matrix R using dot product of protein and compound embedding where R∈ℝNa×Ns. It provides information about the relationship between the substructures of compounds and protein sub-sequences. GraphATT-DTA reflects the local interactions by considering the crucial relationships between protein sub-sequences and compound substructures. The subseq-wise/atom-wise SoftMax is applied to the relation matrix to construct the substructure and sub-sequence significance matrices. The formulas appear in (5) and (6). The element of substructure_significance indicates the substructure’s importance to the sub-sequence. Similarly, the element of subsequence_significance indicates the sub-sequence’s importance to the substructure.

substructure_significance=aij=exp(rij)∑i=1Naexp(rij)

(5)

subsequence_significance=sij=exp(rij)∑j=1Nsexp(rij) 

(6)

The substructure_significance is directed to the drug embedding matrix via element-wise multiplication (⊙) with aj and D, where aj∈ℝNa×1, and j=1, …, Ns. aj indicates each substructure’s importance of the jth sub-sequence. D(j)′∈ℝNa×d indicates the drug embedding matrix with the importance of the jth sub-sequence. Drug vector d(j)″ is constructed by (8) and carries the information of the compound and the jth sub-sequence, where d(j)″ ∈ℝ1×d.

d(j)″=∑aD(j)ab′

(8)

D″=concat[d(1)″,d(2)″, …, d(Ns)″]

(9)

The concatenation of d(j)″ with all sub-sequences causes D″ to inherit all information about the sub-sequences and compounds, where D″∈ℝNs×d. The new drug feature is thus constructed to reflect all protein sequences and compound atoms where drug_feature∈ℝ1×d.

drug_feature=∑iDij″

(10)

The new protein feature is calculated the same way. Using the element-wise multiplication of the subsequence_significance and protein embedding matrix, the protein embedding matrix, P(i)′, with the ith substructure significance to the sub-sequence, is constructed so that si∈ℝ1×Ns, S∈ℝNs×d , and P(i)′∈ℝNs×d. The summation of P(i)′ makes protein vector p(i)″ with the sub-sequence information about the compound, where p(i)″∈ℝ1×d. After the concatenation of p(i)″, the summation of P″ makes the new protein feature vector reflect compound sub-structure significance information, where P″∈ℝNa×d, and protein_feature∈ℝ1×d.

p(i)″=∑aP(i)ab′

(12)

P″=concat[p(1)″, p(2)″, …, p(Na)″]

(13)

protein_feature=∑iPij″ 

(14)

Protein and drug features reflecting the local-to-global interaction information are collected via concatenation. The fully connected layers can then predict the binding affinity. We use mean squared error (MSE) as the loss function.

2.6. Implementation and Hyperparameter SettingsGraphATT-DTA was implemented with Pytorch 1.5.0 [35], and the GNN models were built with DGL v.0.4.3(2) [27] and DGL-LifeSci v.0.2.4 [28]. Early stopping was configured with the patience of 30 epochs to avoid potential overfitting and obtain improved generalization performance. The hyperparameter settings are summarized in Table 5. Multiple experiments are used with five-fold cross-validation, applied for hyperparameter selection. The layers of GNN are important because they pertain to how many neighbor nodes are regarded by the model. Because there are many layers, the model can consider many neighbors; however, doing so can cause an over-smoothing problem in which all node embeddings converge to the same value. Additionally, if the number of layers is too small, the graph substructure will not be captured. Therefore, proper layer configuration is important. The optimal number of GNN layers was experimentally chosen for GraphATT-DTA by using each GNN graph embedding model. Specific experimental results can be found in Supplementary Table S2 and Supplementary Figure S1.

留言 (0)

沒有登入
gif