Identification and analysis of depression and suicidal tendency of Sina Weibo users based on machine learning

: Recent years, with the development of big data, artificial intelligence, natural language processing and other technologies, the research on automatic mental health assessment driven by social network data has provided great convenience for the detection of depression and the suicidal tendency identification. In this study, machine learning and natural language processing technology had been adopted to identify depression of 203 Sina Weibo users. The recognition accuracy of 88.2% is achieved by using Gradient Boosting algorithm. Further, the suicidal tendency of 1204 Sina Weibo texts was identified and the Gradient Boosting algorithm was used to achieve an accuracy of 82.4% of Sina Weibo users with depression tendency. Then the suicidal ideation was analyzed by Beck Scale for Suicide Ideation (BSS) of Sina Weibo users with severe depression tendency. The results showed that most Sina Weibo users with severe depression tendency would hide their suicidal intention. It is also found that the word frequency related to suicidal ideation in the text of Sina Weibo users with severe depression tendency has no correlation with suicide intensity and suicide risk through statistical correlation analysis. The score of Beck Depression Inventory (BDI) of Sina Weibog users with severe depression tendency has a certain positive correlation with the score of suicidal ideation and suicide risk. The above identification and analysis of depression and suicidal tendency of Sina Weibog users will help quickly tap the depression and suicidal mood of users, so as to assist psychological workers and medical staff to carry out early warning intervention and avoid tragedy.


Introduction
Recent years, mental health problems have become one of the most serious and common public health problems in the world. According to the data provided by the World Health Organization, more than 350 million people around the world are affected by depression [1] , 3000 patients with depression commit suicide every day. Thus, suicide caused by psychological problems is one of the three main factors of young people's death [2] . The above data shows that the current mental health problem has become an important problem affecting social development and personal health. However, the public's understanding of depression is not clear enough at present.
Depression is characterized by significant and lasting black mood. Its external performance is no different from that of ordinary people, so that many people who have suffered from depression do not know they are ill. Thus the resulting deterioration of their condition and unhealthy psychological state for a long time by depression have a great impact on the life of patients. Therefore, earlier detection and intervention of depression is of great significance to improve personal quality of life and promote good social mentality. In recent years, with the development of big data, artificial intelligence, natural language processing and other technologies, more and more scholars begin to use data science and technology to explore the psychological and behavioral mechanisms of individuals or groups.Further, the automatic detection and analysis by the social network data generated by users could help identify their the mental health status and to achieve the purpose of automatic intervention and early warning.
In terms of automatic detection of depression, the language and behavior characteristics of Internet users was used to detect depression through classification and regression model [3] , a new method was further proposed to detect depression through network behavior time-frequency analysis [4] . The research showed that users' depression could be predicted through social network data. In addition to the automatic detection of depression, many scholars have carried out early warning research on network suicide tendency aiming at the suicide risk caused by depression. Pete et al [5] classified social media texts, identified suicidal ideation of the social text, and analyzed the precursory characteristics of suicidal ideation to help explain the language used by social media users who perceive suicide. Deep neural network was also used to construct a personalized knowledge map for suicide in the study [6] .The constructed personalized knowledge map was utilized to determine the key risk factors of personal suicidal ideation. The results showed that the accuracy of suicidal ideation detection based on social media could reach more than 93%.
The above research on depression and suicide risk has achieved great progress, but less research on suicidal tendency of users with depression has been carried out. At the same time, the earlier research has not integrated the depression of Sina Weibo users and the suicide risk caused by depression, and there is a lack of in-depth analysis on the internal logic and relationship between depression and its derived suicidal tendency. Therefore, this study aims at Sina Weibo users as the research object. Firstly, Sina Weibo users are detected and identified with depression tendency through machine learning method. Then each text of these users with depression tendency is taken as the object to identify the suicidal tendency of their text.Finally, the suicidal ideation of these users with severe depression tendency is analyzed so as to provide a research basis for further risk early warning and active intervention.

Data acquisition and preprocessing
This study selects Sina Weibo as the data collection platform. In terms of Sina Weibo text data tagging, the depression score of Sina Weibo users was obtained through the depression scale as the basis of tagging considering the problems that manual tagging embodied with the shortcomings of strong subjectivity, inaccuracy and large personal differences. In the process of data collection, Beck Depression Inventory(BDI) [7] and Self-rating depression scale(SDS) [8] were randomly distributed to a large number of Sina Weibo users to obtain the scale score of each user. According to the scale score, Sina Weibo users were divided into normal users and users with depression tendency. Sina Weibo users' texts are collected using Descendant collector with the knowledge and authorization of Sina Weibo users. In this study, texts with 102 users of depression tendency and 101 normal users are collected.
Sina Weibo text was preprocessed in python to remove invisible characters, redundant spaces, punctuation and other invalid contents, as well as contents containing regular expressions and emoticons. Jieba was used for Chinese word segmentation of Sina Weibo text, and the processing of stop words was carried out on this basis.

Recognition of Sina Weibo users' depression tendency
In terms of Sina Weibo texts' feature extraction, term frequency-inverse document frequency (TF-IDF) [9] was taken as the tool of text feature transformation in the early stage. Numerous classifiers such as Neural network, Naive Bayes [10] , Random forest [11] , logistic regression, Adlboost [12] and Gradient Boosting [13] were studied,and their classification effects were compared after fully adjusting the parameters. During the construction of machine learning model, the input was the preprocessed users' text, and the output was whether the user was with depression tendency or not, which was a binary classification problem. 88.30% In order to test the performance of different algorithms, in addition to the four indicators (accuracy, accuracy, recall and F1 score), simple cross validation(75% of the data is divided into training sets and 25% of the data is divided into test sets) was used to compare different algorithms, as shown in Table 1. As could be seen from table 1, the performance of Gradient Boosting is higher than that of other algorithms. The algorithm implementation process was completed by using Python machine learning package scikit learn. 88.2% In order to analyze whether different algorithms had statistically significant differences, simple cross validation , 2-fold cross validation, 3-fold cross validation, 5-fold cross validation, 10 fold cross validation and 20 fold cross validation were used to analyze the recognition accuracy of different algorithms. The results are shown in Table 2.
One way ANOVA was used to statistically test the accuracy of different cross validation methods for different algorithms in Table 3. It could be seen from table 3 that there is a significant difference in the accuracy of different algorithms using different cross validation methods, and different algorithms have a significant impact on the recognition accuracy.   After the above analysis, it could be conclude that Gradient Boosting algorithm achieved the best classification accuracy of 88.2%. The confusion matrix of Gradient Boosting algorithm is shown in Figure 1. It could been seen from Fig 1 that among all 102 text data, 12 text data of subjects with depression tendency were incorrectly identified as text data of normal subjects.

Identification of suicidal tendency of Sina Weibo users with depression tendency
In addition to the recognition of depression, it is also of great significance to study the suicidal tendencies of Sina Weibo depression users for depression is often accompanied by significant and lasting depression, even pessimism and suicide attempts or behaviors. Thus, based on the previous analyse, this study further carried out the research on suicide tendency by using natural language processing technology and machine learning methods.  [14] , each Sina Weibo users with depression tendency was classified into suicide tendency grades, as shown in Table 4. Among all 1204 Sina Weibo texts, 933 of which were labeled as 0, 171 of which were labeled as 1 and 100 of which were labeled as 2.

Data preprocessing
Sina Weibo text was preprocessed in python to remove invisible characters, redundant spaces, punctuation and other invalid contents, as well as contents containing regular expressions and emoticons. Jieba was used for Chinese word segmentation of Sina Weibo text, and the processing of stop words is carried out on this basis.

Suicidal tendency identification
Considering that there are many machine learning classifiers, and the effect of classifiers is different in different situations. Therefore, this study compared many classifiers such as neural network, k-nearest neighbor algorithm [15] , random forest, logistic regression, adlboost and Gradient Boosting. In the process of building the machine learning model, the input was each Sina Weibo text of the pretreated Sina Weibo users with depression tendency, and the output is the Sina Weibo text suicide tendency grade, which is a three category problem. In order to test the performance of different algorithms, in addition to four indicators (accuracy, accuracy, recall and F1 score), simple cross validation was used to compare different algorithms after fully adjusting the parameters, as shown in table 5. It could be seen from the table that Gradient Boosting was the highest than other algorithms in four indicators. The algorithm implementation process was completed by using Python machine learning package scikit learn.   The confusion matrix of Gradient Boosting algorithm is shown in Figure 2. According to the confusion matrix of Gradient Boosting algorithm, of all 478 Sina Weibo texts with 0 Tags, 10 were incorrectly identified as tag 2 and 3 texts were incorrectly identified as tag 1; Among all 45 Sina Weibo texts with tag 2, 27 texts were incorrectly identified as tag 0, and 2 texts were incorrectly identified as tag 1; Among all 79 Sina Weibo texts with tag 1, 59 texts were incorrectly identified as tag 0, and 5 texts were incorrectly identified as tag 2.

Analysis of suicidal ideation
Based on the above research on suicidal tendency, this study further explored the suicidal ideation of Sina Weibo users. After the construction of suicidal tendency index, the data of suicidal ideation was collected by issuing the Chinese version of Beck Scale for Suicide Ideation (BSS) to Sina Weibo users who suffered with severe depressive tendency.With the knowledge and authorization of Sina Weibo users, 43 valid questionnaires were collected, including 41 users with suicidal ideation, accounting for 95.35%. It can be seen that the vast majority of Sina Weibo users with severe depression had suicidal ideation, which was worthy of vigilance.
Due to the lack of data of few users, some relevant data of suicidal ideation of 38 Sina Weibo users with severe depression tendency was obtained. It was found that among the total 38 Sina Weibo users with severe depression tendency as well as suicidal ideation, 20 users had suicidal ideation related words in their Sina Weibo text, accounting for 52.62%. And about "do you let people know your suicidal thoughts?" ,Only 8% of users would frankly and actively say their thoughts of suicide in the recent week, while 22% of users would frankly and actively say their thoughts of suicide in the most depressed and melancholy time according to BSS(see Figure 3). This meant that although Sina Weibo users with severe depression tendency with suicidal ideation would not frankly and actively say their suicidal thoughts, some of them would express their suicidal thoughts hidden in real life at the platform of Sina Weibo. This was the research significance of using social networks to identify users' suicidal thoughts. On the other hand, although there were words related to suicidal ideation in the Sina Weibo text of some users with severe depression tendency, the outcome of BSS showed that they had no suicidal ideation. This might due to that the emergence of these suicidal ideation words was only the current suicidal ideation of users, not enough to form suicidal ideation. The correlation analysis was carried out on the word frequency and suicide intensity related to suicidal ideation in the texts of Sina Weibo users with severe depression tendency. Spearman coefficient was used to explore the correlation between the two factors. The analysis results are shown in Table 8 that the bilateral significance of the two factors was 0.389, greater than 0.05. So there was no correlation between the word frequency and suicide intensity. The correlation analysis was carried out on the word frequency and suicide risk related to suicidal ideation in the texts of Sina Weibo users with severe depression tendency. The analysis results are shown in table 9. Spearman coefficient was used to explore the correlation between the two factors. The analysis results show that the bilateral significance of the two factors was 0.375, greater than 0.05, which meant that there was no correlation between the word frequency and suicide risk. The correlation between suicidal ideation intensity and the score of Beck depression questionnaire was analyzed using Pearson correlation coefficient. The result shows that the correlation coefficient of the two factors is 0.351, which was significantly correlated at the level of 0.05 (bilateral), as shown in Table 10. This meant that the depression tendency of Sina Weibo users with severe depression tendency had a certain positive correlation with their suicidal ideation. 0.05, 0.01 pp     The correlation between suicide risk and Beck Depression Questionnaire score is analyzed by Pearson correlation coefficient. The data showed that the correlation coefficient of the two factors was 0.429, which was significantly correlated at the level of 0.01 (bilateral), as shown in Table 11. This meant that the depression tendency of Sina Weibo users with severe depression tendency had a certain positive correlation with their suicidal ideation.

Discussion
In this study, Sina Weibo users were selected as the research object, and 203 Sina Weibo users were identified with depression emotion through data mining, natural language processing technology and machine learning algorithm. Gradient Boosting algorithm was adopted to achieve the recognition accuracy of 88.2%. Furthermore, for Sina Weibo users with depression tendency, the suicidal tendency of 1204 Sina Weibo texts was identified, and the Gradient Boosting algorithm was used to achieve an accuracy of 82.4%. Several different machine learning algorithms were compared. On the basis of fully adjusting parameters, it was found that Gradient Boosting algorithm had the best recognition effect, because it could integrate the advantages of multiple algorithms as an integrated algorithm. At the same time,the reason why Gradient Boosting algorithm achieved the best classification accuracy among many machine learning algorithms was that it had good generalization ability and representation ability on densely distributed data, while the Sina Weibo text data in this study had a large number of similar synonyms.
In the research of suicidal tendencies and suicidal ideation of depressed Sina Weibo users, this study first explored the suicidal tendencies of depressed Sina Weibo users. Through the earlier data preparation, machine learning had achieved a relatively ideal recognition rate, that was, the text styles of depressed Sina Weibo users with different degrees of suicidal tendencies were different, and their suicidal tendencies could be inferred from the text. Based on the study of suicidal tendency, this study further conducted the suicidal ideation of Sina Weibo users with severe depressive tendency through the BSS and the word frequency related to suicidal ideation in the text. It was found that the word frequency of suicidal ideation of Sina Weibo users with severe depressive tendency was not significantly correlated with suicidal ideation and suicide risk, which meant it was not feasible to infer their suicidal ideation and suicide risk by word frequency only.
In order to further explore the reason why the word frequency of suicidal ideation of Sina Weibo users with severe depression tendency was significantly irrelevant to their suicidal intensity and risk, it was found that most users tend to hide their suicidal intention, whether in real life or in social networks during the in-depth analysis of BSS data. This outcome also indirectly confirmed the characteristics of patients with depression that almost of them were more inclined to close themselves, falling into their own world and couldn't extricate themselves. Therefore, more methods should be used to explore users' suicidal ideation by future research. In this regard, considering that most users would not express depression and suicide content, some researchers used social network nodes to mine depressed users. Researchers [16] regarded patients with depression as a node, built a graph network with this node as the center, and gave a model to calculate the depression status according to the attributes and connection weights of adjacent nodes in the network.
The research of mental health assessment based on social network also involves many problems, such as data privacy, information disclosure and so on. Data privacy is a continuing concern that once users are labeled with psychological problems, they may be discriminated against or ridiculed. Therefore, data protection and ownership framework agreements are needed to ensure that users will not be hurt.

Conclusion
Based on the identification and analysis of depression and suicidal tendency of Sina Weibo users based on machine learning, this study drew the following conclusions by combining the knowledge of psychology, data mining, natural language processing, machine learning and statistics. (1) Aiming at whether Sina Weibo users had depression tendency, the research labeled Sina Weibo users through depression questionnaire, and excavated the Sina Weibo text data of users. After preliminary data preprocessing and feature transformation, comparing different algorithms and fully adjusting parameters, Gradient Boosting algorithm was used to identify the depression of Sina Weibo users, with an accuracy of 88.2%. (2) This study aimed at the suicidal tendency of Sina Weibo users with depression. Each Sina Weibo text was taking as the research object,thus three-level suicidal tendency was labeled. After preliminary data preprocessing and feature transformation, different algorithms were compared. Then Gradient Boosting algorithm was used to detect suicidal tendency, and the recognition accuracy reached 82.4% after fully adjusting its parameters. (3) There was no significant correlation between the word frequency related to suicidal ideation and suicide intensity as well as suicide risk in the text of Sina Weibo users with severe depression tendency. There was a significant positive correlation between the score of BDI and the score of suicidal ideation as well as suicide risk. Most Sina Weibo users with severe depression tendency will hide their suicidal intention.