Research on Efficient and Low-cost Drug-disease Association Prediction Method Based on Dual Attention in Heterogeneous Networks

: Drug development usually costs a high cost, so it is very important to establish an efficient, low-cost and accurate prediction method of drug-disease correlation. In this paper, a drug-disease prediction method based on dual attention in heterogeneous networks is proposed. First, the experimental data set is constructed through the biological database, then the node feature information in the heterogeneous network is extracted by the graph attention network, and the node feature information is filtered and enhanced by SENet. Finally, through the 10% discount cross verification evaluation, GASEDDA achieved an accuracy of 98.5%.


Introduction
The overall drug research and development can be summarized into three stages, the first is the drug discovery stage, the second is the preclinical research stage, and the third is the clinical research stage, which is difficult to develop.New drug development usually takes 10-15 years and an investment of 1.5 billion US dollars.The overall development process is time-consuming and risky.In the United States, more than 100 drugs are screened by the Food and Drug Administration (FDA) every year before they are approved for market, and eventually there are only about 20% on the market [1].The number of new drugs approved around the world is declining year by year and the failure rate of new drug approval has been higher than 90% since the 1990s [2].In order to solve the problems in the process of new drug development, relevant personnel try to develop new drugs through the method of drug repositioning [3].
Drug repositioning is the process of determining the potential indications of existing drugs and discovering new drug treatments for diseases.Drug repositioning is a new drug research and development strategy, and it is also considered as one of the best risk-benefit strategies in the existing drug research and development strategies [4].It has attracted close attention all over the world.Compared with traditional drug development, drug repositioning has incomparable advantages, which can not only shorten the screening scope of drug development, but also save a lot of money and time.The traditional drug relocation methods generally include drug common biochemical characteristic analysis [5], drug prescription screenin [6], molecular activity similarity analysis [7] and so on.With the continuous development of related research, the development and use of various biological databases, such as DrugBank [8], PubChem [9], SIDER [10], etc., provide a large number of opportunities for the development of drug relocation based on computing methods, so that computational drug relocation has a very broad development prospect and potential, and has been paid more and more attention by relevant researchers.
At present, researchers focus on identifying new drug targets by using drug chemical structure, pharmacology and genome properties.Some scholars have proposed to mine the potential indications of listed or unlisted drugs by directly predicting the relationship between drug diseases.The existing drug relocation methods are mainly divided into recommendation system-based methods, machine learning-based methods, deep learning-based methods and web-based methods.The method based on recommendation system mainly uses matrix decomposition to complete the task, but because of the problem of cold start, it is not suitable for the prediction of new drugs or new diseases.Machine learning-based methods are widely used, and then they rely heavily on input data that can reflect the characteristics of drug diseases, which is difficult to meet in practice.The method based on deep learning can make use of its strong learning ability to transform the original data features into abstract feature representation, which can perfectly solve the incompleteness of manual screening features.But they need a lot of training data to obtain high precision, that is to say, when the input drug-disease association network is too sparse, the method based on deep learning is easy to appear over-fitting.The web-based approach captures similar information from different types of biological networks as a feature of drugs and diseases.In this method, heterogeneous networks are usually introduced to represent different types of biological information, and their similarities are retained in different biological networks, so as to obtain unobserved associations between drugs and diseases.
Attention mechanism is widely used in a variety of deep learning tasks, such as natural language processing, image recognition and speech recognition, and has become one of the core technologies in the field of deep learning.When processing the information received by the outside world, the human brain will focus its attention on the key information of high value and interest, and the attention mechanism is inspired by the way the human brain processes information.It can be regarded as a combinatorial function, which highlights the influence of key inputs on output by calculating the probability distribution of attention.In bioinformatics, attention mechanism is also widely used, such as using layer attention mechanism to predict drug-disease association, integrating multiple biological relationships for drug-disease association, and so on.
In this paper, a graph attention heterogeneous network model based on SEnet is proposed to predict drug-disease association.The related information of drugs, diseases and genes is collected through the biological database, and a benchmark data set is constructed.Through the known drug-disease association, drug-gene association, disease-gene association and calculating drug similarity, disease similarity and gene similarity, we construct a heterogeneous network to predict drug-disease association.Based on this heterogeneous network, we extract information features from the similarity network through the graph attention mechanism, and then recalibrate through the channel features.Finally, an integrated embedded prediction module is used to predict the unobserved drug-disease association.According to the computer simulation experiment, the method proposed in this paper achieves 90.2% AUC score and 98.5% ACC score.Compared with other cash methods, the method proposed in this paper is better.

Dataset
In order to effectively evaluate the model proposed in this paper, this paper collects relevant data through biological databases such as CTD [11] and DrugBank [8], and constructs a benchmark data set, including 709 drugs, 5604 diseases, and 1513 proteins.
It contains 199214 drug-disease edge, drug-protein edge, disease-protein edge.

Disease-disease similarity
The medical subject word identifier of disease can be described as a hierarchical directed acyclic graph DAGs.
In this paper, the DAG structure is used to calculate the semantic similarity of diseases.
where ∆ is the semantic attenuation factor, according to previous research, here we set it to 0.5, and the semantic contribution of disease d to itself has a value of 1. From Eq. 1, we know that the main contribution of disease   is determined based on the distance between disease d and disease   , and by summing up the contributions of all the ancestor nodes of disease d, we use Eq. 2 to obtain the semantic value of   .
Combining Equation 1and Equation 2, we can get the semantic similarity between disease   and disease   : where the contribution of   to   and   is denoted as DV(  ) and DV(  ) respectively

Drug-drug similarity
Drugs are special chemicals used by human beings to prevent, treat, or diagnose diseases, or can regulate the function of the human body, improve the quality of life, and maintain good health.It usually has different characteristics of biological and chemical properties.We can convert drugs into many types of feature vectors by their characteristics and calculate drug similarity based on these features.In this paper, we download the drug SMILES sequences from DrugBank [8] and convert them into topological fingerprints of the drugs, and calculate the similarity between two drugs based on the fingerprint loci and Tanimoto similarity.Assuming drug   and drug   , the similarity between them can be calculated by Equation 4and Equation 5 (  ,   ) = 1 − min() max()−min() (5)

Gene-gene functional similarity
Calculating gene-gene functional similarity is the basic work of bioinformatics, which is an important part of life science research.In this paper, we use GO to study the similarity between genes.
The GO graph uses a directed acyclic graph to represent structured relationships between biological terms, as shown in Figure 1.A node in the graph represents a term, and in addition to the root node, each node has the possibility of multiple parent nodes and may have multiple children.The depth of a node indicates the shortest path between that node and the root node.The closer a node is to the root node indicates the more general the term semantics, and conversely, the further it is from the root node indicates the more explicit the term semantics.According to previous research, it is believed that the deeper the depth of the Lowest Common Ancestor (LCA) between two nodes, the more similar they are.
where   and   denote the lengths of the shortest paths between   and   to the LCA, respectively, and H is the length of the shortest path between the LCA and the root node.
Figure 1: Relationships between some of the nodes in the biological process subgraph of the GO

Heterogeneous network construction
After obtaining drug-drug similarity, disease-disease similarity and gene-gene similarity, drugdisease associations we get from DrugBank [8], and drug-gene associations as well as disease-gene associations we get from CTD [11].In order to better represent the heterogeneous networks, we use a parameterized form to represent the heterogeneous networks.Take drug-disease association as an example, the drug-disease association is represented as a kind of binary network   ∈ {0,1} × , where N is the number of drugs and M is the number of diseases.If drug dr is associated with disease di, then  , = 1, otherwise  , = 0. Finally, the heterogeneous network model is constructed as shown in Equation 7:

Models and Methods
In this section, we will formally introduce the SEGAT method for drug-disease association.It includes GAT [12] node feature extraction; Senet feature aggregation and enhancement; and drugdisease association prediction.The workflow of SEGAT is shown in Figure 2.

Graph attention network
Graph Attention Network GAT was proposed by Veličković [12] et al. in 2017 and the main idea is to apply the attention mechanism to graph structures.The core of GAT is the graph attention layer, which takes as input a set of node features ℎ = {ℎ ⃗ 1 , ℎ ⃗ 2 , … , ℎ ⃗  }, ℎ ⃗  ∈ ℝ  as inputs, where N denotes the number of nodes and F denotes the node feature dimensions, and then outputs the new node feature F^' representations by going through the graph attention layer.After obtaining the new feature representation, a shared attention mechanism a: R^(F^' )×R^(F^' )→R is applied to obtain the attention coefficients: The attention coefficient  , indicates the importance of node j's features for node i.To make the attention coefficients of different nodes more interpretable, they are normalized using softmax.
After obtaining the normalized attention coefficients, the representation of the neighboring nodes is applied to the node and the features of the node are updated.After that, the features of the neighboring nodes are averaged by doing a weighting process and activated using a nonlinear function σ.
From this we can get the features output from the GAT layer.

Feature Enhancement
In this module, we hope to devise a way to enhance the model's focus on important features.We capture the contribution of each channel's features to the original signal through self-learning, and then enhance important features and suppress unimportant or even useless features based on the contribution of each channel to the original signal Chengdu.This method is also known as the principle of feature recalibration.The obtained ℎ ⃗  ′ is subjected to Squeeze operation to compress the features in spatial dimension.
After learning the importance on the spatial dimension, we learn the importance of the channels through the Excitation operation to get the weights between different channels.The formula is shown below: Finally, the SEnet enhancement of the original features is accomplished by weighting the channel features through the Scale operation, which treats the output weights of Excitation as an important percentage of each channel.

Prediction
After graph attention and SEnet, we finally obtain the drug-disease embedding vector In the experiments of this paper, we use a bilinear inner product decoder to construct the association matrix between drug-disease.
where  ′ ∈  × is a trainable weight matrix for the association prediction score between drugs and diseases determined by the corresponding(, ).Each element of  , ′ denotes the association score between drug  and disease .

Evaluation Metrics
In order to objectively and effectively evaluate the accuracy of this experiment, we use ten times cross-validation to reduce the errors caused by data problems, and we use several indicators including AUC, AUPR, F1 score, and recall rate to evaluate the performance of the model as comprehensively as possible.The calculation formula is as follows: Where TP is true positive, indicating correctly predicted drug-disease relationships; FP is false positive, indicating incorrectly predicted drug-disease relationships; FN is false negative, indicating incorrectly predicted but actually labeled drug-disease relationships; and TN is positive negative, indicating correctly predicted unlabeled drug-disease relationships.

Comparison with other algorithms
We compare this experiment with other drug-disease association algorithms to demonstrate the effectiveness of our experiment.We conduct a comparative experiment between the method of this paper and the methods of Kang [13] and Chen [14] on the same dataset.As shown in Table 1, the experimental method used in this paper possesses an accuracy rate of 98.5%, which is far superior to other methods, indicating that the method in this paper can more accurately make a better prediction of whether a drug can treat a disease.

Case Study
In order to validate the ability of this experiment in discovering new drug-disease associations, known drug-disease associations are obtained through biological databases such as Drugbank and CTD to train the model of this experiment, which is utilized to predict new drug-disease associations.We validated this through approved clinical trial studies and public literature (Table 2).For example, Etomidate [15] (Etomidate), a white powdery substance insoluble in water, is one of the commonly used drugs for induction of anesthesia and has been in clinical use for 30 years, with a rapid but shortlived action, fast sleep onset and awakening, and strong depressant effects on the central nervous system.Phenytoin [16] (Phenytoin) is mainly used for antiepileptic, antiarrhythmic, by highly selective inhibition of the cerebral cortex motor area.It is considered as the drug of choice for the treatment of grand mal and partial seizures.Literature [23] In addition, to further test the validity of our model, we examined the top five disease candidates for Sulfasalazine and the top five drug candidates for HIV, a drug used to treat inflammatory bowel disease.Tables 3 and 4 show the results of our experiments, which can be confirmed according to some public literature and clinical medical studies.3 and Table 4, we can see that the present model can help to identify new drugdisease associations.

Conclusions
In this paper, we developed a GASEDDA model to discover drug-disease associations.The drugdisease associations were successfully predicted by combining the graph attention mechanism and SEnet attention mechanism through a heterogeneous network consisting of drug-drug similarity, disease-disease similarity and gene-gene similarity.From the results, it can be seen that the method of this model is superior to other drug-disease association prediction methods.
In future research, considering that biological network is a huge and interconnected large-scale network, we will consider adding more biological attribute networks, such as proteins, drug targets and so on.Secondly, although GAT is a powerful graph neural network method that can effectively extract the node information in the network, it loses the structural information in the network, and in the future, we hope to solve this problem by methods such as graph embedding.
For disease d, DAG (d) = (N (d), E (d)), N (d) denotes the node set of disease d and all ancestors of d, and E (d) represents the relationship of all the relationships between diseases in N (d).The semantic contribution of a disease d ∈ N (d) to d can be expressed as follows:

Table 1 :
Comparison of method performance

Table 2 :
Top ten drug-disease associations predicted in this experiment

Table 3 :
Top 5 Disease Candidates for Sulfasalazine

Table 4 :
Top 5 Drug Candidates for HIV Infections