Research on Android Malware Detection Based on Bayesian Network

Objective As the Android apps spring up at present, combined with the open source nature of Android system, malicious codes are easily embedded in Android apps, leading to a serious threat to users. However, most detection methods based on app permission features have neglected the correlation among permissions, resulting in a poor practicability and a high false alarm rate. Therefore, a malware detection method based on Bayesian network was put forward in this paper. Method The permission data of a range of Android apps were analyzed to determine the Bayesian network structure and parameter distribution on basis of expert knowledge. The open source Android app data set was introduced to verify the model, and multiple detection algorithms integrated with multiple indicators were adopted for comparison, so as to discover the maximum possible features of the malware based on the network structure. Results According to the analysis on accuracy, precision, recall, and F1 value, the indicators of this method are higher than those of logistic regression and random forest methods. The location where the malicious code is most possible embedded could be reasoned backwards by this method. Conclusion The method is accurate and feasible to locate the permission to generate malicious behavior finally in case of known malwares, providing a basis for locating the malicious code.


Introduction
With the progress and rapid development of technology and network in recent years, mobile intelligent devices have been comprehensively popularized and widely used in various industries in the society. Mobile apps have changed people's lives to a great extent and have been closely bound people since its emergence. Most of them may involve the user's personal information, payment information, positioning information, etc. Therefore, the security of mobile apps has attracted broad attention. As of the third quarter of 2019, benefitting from the open source service of Android, its operating system accounted for much more than the iOS operating system in the mobile operating system of the CNCERT/CC (National Computer Network Emergency Response Technical Team/Coordination Center of China) [1]. Attributing to the openness of the Android ecosystem, users may download any desired apps from any third-party platform, and developers may upload the developed mobile apps to any third-party platform for users to download, causing malwares to be very active on the Android platform. According to the 2019 Mobile Security Status Report released by the 360 Internet Security Center in February 2020, about 1.809 million new malware samples were intercepted on the mobile terminal, and the malicious behaviors of which are designed with diverse types, including privacy theft, remote control, and rogue behaviors, malicious deductions, and fraud software[2], posing a serious threat to the security when users use mobile phones. Major mobile phone security platforms have put their emphasis on the research of security detection of mobile apps, which is also a hotspot in the field of information security.
Existing methods to detect malwares have certain effects, but malwares continue to enhance their capacities to camouflage and deceive the detection system and evade detection. The functions of malwares must also be in necessary connection with permissions. Permissions are the best embodiment of Android applications with specific functions, which is enforced by Android in terms of security design. Therefore, an array of experts and scholars generally ignored the correlation between permissions and reduce the detection accuracy when researching app permissions, which increased the false alarm rate to a certain extent. How to establish a network model of interrelated features based on existing Android application permission feature information to achieve efficient and reliable identification of malicious applications is a problem that needs to be solved urgently in Android application detection.
In order to solve this problem, the permission feature was proposed as a node, the influence relationship as a path, and the influence degree as a path parameter in this paper to establish a Bayesian network structure and solve the problem that the structure learning algorithm occupied a large amount of memory. A priori probability and conditional probability distribution were obtained based on the statistics of historical data, a complete Bayesian network was established, and then the Bayesian network inference algorithm was introduced to detect the security status of the Android app, and the permission features of the malware were reasoned backwards, thereby providing a basis for quickly locating malicious code in the future. According to the results, the Bayesian network established with this method can effectively detect the security situation of Android apps, analyze the specific permissions that affect the security situation, and acquire a good detection efficiency.

Related Research
The malwares can be roughly detected with two categories of methods: signature-based detection technology, and behavior-based detection technology [3]. In view of signature-based detection technology [3], QIN Zhongyuan, et al. [4] proposed to generate signatures from API, Method, Class, and APK on basis of smali files, and extract common signatures from the knowledge base according to different categories and match them to judge the category of malwares and define the malicious code retrospectively. This method effectively solved the problem of locating malicious codes. NING Zhuo, et al. [5] discovered the data flow patterns frequently operated by malwares automatically by resorting to data type analysis technology, and enhanced the ability to detect malwares with code obfuscation in combination with the improved multi-level signature detection algorithm. The signature-based detection technology has been applied in major security detection platforms. However, it can only detect malwares falling under the signature library, ineffective to detect malwares beyond the scope of signature library.
The behavior-based detection technology is mainly achieved through static detection and dynamic detection [6][7]. Bhatia, et al. [8] automatically and randomly triggered the non-pass function of the app via Android Monkey, and called the strace software to monitor the API called by the running app, comprehensively output the statistical data of api calling, and distinguished the malwares on basis of machine learning algorithm. This method effectively realized the ability to automatically identify malicious applications and improved the execution efficiency, but it had a low accuracy because Android Monkey cannot effectively trigger all functions of the app. Feng, et al. [9] proposed an optimization algorithm for multi-feature selection to identify the feature of the collected data and eliminate irrelevant ones through the feature selection algorithm, and finally distinguished the malwares through the integrated learning algorithm and improve the distinguishing accuracy of the system, but it takes a lot of time in resource processing.
Static detection technology is to detect and interpret the decompiled code and determine whether it is a malware, enjoying a merit of high code coverage, which can detect the malwares before the app is executed, but it may be ineffective under the circumstance of code obfuscation and encryption [10][11] [12]13]. Mariconti, et al. [14] adopted Package and Family mode to abstract the API call sequence with API as the main research object of the static detection method, and distinguished the malwares in combination with the classification algorithm. However, the classification feature dimension obtained by this method was high, because the effect of dimensionality reduction via PCA was not ideal, and a lot of resources were consumed for classification. According to McLaughlin, et al. [15], it was simpler to use CNN than the feature sequence generated by N-gram by training the improved Dalvik opcode sequence in CNN convolutional neural network, and learning feature sequences of malwares, which also acquired a higher accuracy.

Bayesian Network
In the 1980s, Judea Pearl first put forward the Bayesian Network (BN), a probabilistic graphical model, which was an extension to Bayesian method [16]. At present, Bayesian network is one of the most effective theoretical models in uncertain knowledge expression, result prediction, cause deduction and probability calculation. Bayesian network is designed in a directed acyclic graph (DAG) consisting of nodes and directed edges. Nodes are the representation of the features, expressed with a random variable. DAG explains the dependency relationship among features. A directed edge represents the relationship among features. Each node has a conditional probability function to represent the probability distribution of the node under the condition that the parent node occurs, called as a conditional probability table (CPT), which describes the degree of dependence among features by a quantitative way.
In view of mathematical description in Bayesian networks, , where G is the network structure, indicating the dependency relationship structure among nodes, is the network parameters, describing the degree of dependence among features. Assuming the parent node of the node , there are multiple parent nodes of the node in the network, and the set of parent nodes of the node in G is represented by .The probability of the node under the condition of the parent node is: It can be deduced that the CPT of the node in the parent node set is: The joint probability distribution of the entire Bayesian network is: The above equation is denoted as Equation 1.

Learning of Bayesian network
Learning of Bayesian network mainly covers network structure learning and parameter distribution learning. Network structure learning is to analyze the training data set, identify the features, determine the nodes, find out the correlation among the features, acquire the dependency relationships among the nodes, and finally draw the network structure to lay a foundation for parameter learning. Research scholars have put forward different solutions to the current problem of structural learning, which can be summarized into two categories. (I) Expert experience: The dependency relationship between features is judged based on the experience knowledge of field experts, so as to establish a Bayesian network structure, its advantage lies in clear structure, maximized logical relationship, and consistent field business, and its disadvantage lies in that it relies too much on the experience and knowledge of field experts, which is more subjective, and features and relationships may be omitted inevitably; (II) Sample learning: If it is difficult to acquire the experience and knowledge of field experts, a large number of data sets need to be collected, trained and analyzed through network structure algorithms, so as to discover the dependency relationship of the features and construct the Bayesian network structure with the highest degree of fitting to the training set, which is a NP-hard issue [17]. At present, the method based on statistical analysis is mainly used, and the dependency relationship among variables is usually analyzed using statistical or information theory methods to obtain the optimal network structure. In addition, a score search method is usually used, in which, the search strategy and scoring criteria are combined to construct a Bayesian network structure based on the structure space established by the nodes, it has an advantage that a highly fitting network structure to the training sample can be obtained, and its disadvantage lies in that different network structures are produced due to different network structure learning algorithms, and it is also susceptible to noise interference, reducing the generalization ability of the network structure.

Forward analysis and backward reasoning of Bayesian network
Let the root node i X , A Bayesian network structure is established for forward analysis through the relationship of the structure to obtain the probability of occurrence. In the network structure, from top to bottom, the probability distribution of the root node is deduced from the prior probability distribution of the root node and the conditional probability part of the intermediate node, thus obtaining the probability of occurrence of final event results.
Assuming that the root node is in the state , the conditional probability that the leaf node T is in the state is: The above equation is denoted as Equation 2. According to backward reasoning, the posterior probability distribution of each root node is inferred from the bottom of the network structure, that is, the largest possible cause of the event is discovered after the event has occurred.
Assuming that the leaf node T is in the state , the conditional probability distribution of each root node in the state : The above equation is denoted as Equation 3.

Research methods
The malicious risk of Android apps is analyzed and detected to determine the features, set nodes, discover the relationship between the characteristics, and establish a network. The specific steps are as follows:

Bayesian network was established for detection of mobile apps
There have been a range of detailed studies on the permission usage and impact relationship in Android Manifest.xml file in the previous researches on mobile app detection.500 mobile apps were collected in the mobile app sample library provided by the Laboratory Center of TCL Mobile Communication, including 280 malwares and 120 legitimate apps. The malicious behaviors of malwares are generally summarized as: downloading malicious installation packages, malicious sending and receiving text messages, maliciously making calls, malicious advertisements, rootkits, and malicious reading of call records, malicious reading of address book, malicious collection of i X i a i X t T t T i X i a i X user voice, and maliciously stealing geographic location. Malwares may realize malicious behaviors by opening corresponding permissions. The permissions of Android apps were monitored and analyzed by detecting the data of 500 mobile apps, and the usage of malwares and legitimate apps on permissions was analyzed as shown in Figure 1.

Fig. 1 Permission Usage of Android Legitimate Apps and Malwares
As shown in Figure 1, we can find that INTERNET, READ_PHONE_STATE, ACCESS_NETWORK_STATE, WRITE_EXTERNAL_STORAGE, ACCESS_WIFI_STATE, WAKE_LOCK, ACCESS_FINE_LOCATION were used by both legitimate apps and malwares, with a high utilization rate. The usage rate of GET_TASKS, WRITE_SETTINGS, SYSTEM_ALERT_WINDOW, MOUNT_UNMOUNT_FILESYSTEMS, CHANGE_WIFI_STATE, READ_LOGS were more used in legitimate apps than malwares. SEND_SMS, RECEIVE_BOOT_COMPLETED, RECEIVE_SMS, READ_SMS, READ_CONTACTS, INSTALL_PACKAGES were more used in malwares than legitimate apps. The above analysis provided a basis for identifying features and subsequent parameter determination. The impacts of the permission function can be divided into network risk impact, short message risk impact, equipment status risk impact, communication analysis impact, installation package risk impact, and location risk impact according to their effect. According to the authority function and the causal relationship that affects the effect, the Bayesian network structure was established on basis of repeated verification with expert experience and knowledge, as shown in Figure 2,

Build root node parameters
In the sampled data sets of 500 Android app, the statistical data of the status of each root node were output, that is,  Table 2 and  Table 3, and the conditional probability distribution of leaf nodes is shown in Table 4.

Test results and analysis
In order to verify the feasibility and effectiveness of this method, the test was carried out on 100 Android legitimate apps and 200 malwares with data sourced from AndroZoo's public data, covering decompilation, permission setting, data extraction, data structuring, and model verification, the 2 B 3 B results obtained by the research scholars using logistic regression, random forest algorithm and this method were introduced to establish a confusion matrix, and verify the verification index definition respectively from the 4 indicators (accuracy, precision rate, recall, F1 value), as shown in Table 5. Backward reasoning and analysis were carried out to draw conclusion that malwares were most likely to be used with specific permissions. Code detection was performed on the permission-related functions with malicious risks in turn to discover the locations where the malicious codes were most likely embedded, and corresponding protective measures were taken. Meanwhile, permissions with a high risk were included to key monitoring objects of mobile apps. Where, TP refers to the number of actual samples predicted as correct samples, FP refers to the number of incorrect samples predicted as correct samples, FN refers to the number of correct samples predicted as incorrect samples, TN refers to the number of incorrect samples predicted as incorrect samples.

Forward analysis and testing
In equation (2), the security level of the Android app was acquired through analysis combined with the established Bayesian network structure. The following three groups of experiments were conducted using Bayesian network, logistic regression, and random forest, respectively, and the effect was measured from the indicators of accuracy, precision, recall, and F1 value.
(1) The confusion matrixes of three sets of algorithms were established, as shown in Table 6. (2) Accuracy, precision rate, recall, F1 value The comparison of the indicators of the three methods is shown in Table 7. As shown in the table above, the accuracy of the three algorithms has reached more than 80%, and the accuracy of the Bayesian network algorithm is better than the other two algorithms. In order to solve the sample imbalance, the indicators of this method were higher than those in other two methods for the analysis on accuracy, precision, recall, and F1 value. Therefore, the test results show that the prediction model for security monitoring of Android apps designed with Bayesian network performs better and has higher prediction capacity than the traditional prediction method.

Backward reasoning and analysis
The security degree of the known app was analyzed by resorting to the backward reasoning and analysis technology of Bayesian network, the posterior probabilities of each root node in different security states are calculated through equation 3, and the posterior probabilities were sorted in size to analyze the feature node that most likely causes an increase in security risk. Set T=H to analyze the posterior probability distribution of each root node when the security status of the Android application is at high risk, as shown in Figure 3. According to the figure, the startup probability is the highest when and are at high risk, the probability is relatively high when , , and are at high risk, and the probability is slightly high when and are at high risk.

Conclusion
In view of the correlation among app permissions, a detection method of Android malware was put forward based on Bayesian network in this paper. By virtue of the data analysis of Android app permissions, a Bayesian network was quickly established combined with expert experience to visualize the impact of permissions, intuitively and clearly display the path of the security impact of each permission on Android apps. 100 legitimate apps and 200 malwares were determined as samples to test this method, and through testing with multiple detection algorithms, and comparison with other methods in multiple indicators, the feasibility and accuracy of the method are fully proved. By resorting to the backward reasoning technology of Bayesian network, this method can successfully locate the permissions associated with malicious behavior, providing a basis for discovering malicious code.
In the face of the complex network environment and the diversification and complexity of malwares, many uncertain factors may occur to the detection method. The follow-up work mainly falls to the following two points: (1) Strengthening the ability of detection methods. On the basis of this research, obtain the public malware in real time via data crawling technology, update the model, and strengthen the robustness of the model.
(2) Enhancing to trace the source of malicious behavior. Further research should be carried out in terms of tracing to the source of permissions that trigger malicious behaviors, and great efforts should be made to add algorithms such as machine learning, enhance the ability to find and locate, and provide references for tracing malicious code.