Prediction of Alzheimer's Disease Based on Random Forest Model

: Alzheimer's disease is a syndrome characterized by acquired cognitive impairment, leading to significant declines in daily life, learning, work, and social functioning. It has a profound impact on the lives of elderly people, making early detection and treatment of Alzheimer's disease an urgent issue. This paper collects relevant data from patients with Alzheimer's disease in a certain hospital, explores the data using histograms, density probability graphs, box plots, and correlation coefficient heat maps after preprocessing. Then it compares the performance of logistic regression classification models, random forest classification models, and REF-random forest models in predicting the accuracy of Alzheimer's disease categories. The results show that the REF-random forest model achieves the highest prediction accuracy. Finally, this paper uses the SMOTE algorithm to process the data and further improve the accuracy of the model. The optimized REF-random forest model has achieved outstanding results in all indicators.


Introduction
Alzheimer's disease is a syndrome characterized by acquired cognitive impairment, leading to significant declines in daily life, learning ability, work ability, and social interaction skills.The cognitive impairment in patients involves memory, learning, orientation, understanding, judgment, calculation, language, visual-spatial functioning, analysis, and problem-solving abilities.It often occurs with mental, behavioral, and personality abnormalities at a certain stage of the disease course.Its etiology may be related to genetic factors, unhealthy lifestyles, cerebrovascular disease, hyperlipidemia, and other factors.[1] With the increasing aging of our population, Alzheimer's disease has seriously affected the lives of the elderly.Therefore, it is urgent for us to detect and treat Alzheimer's disease patients as soon as possible.Before this paper, Fan Yu [2]used multiple machine learning models for prediction but did not optimize or improve them.This paper will use a doubleoptimized random forest model to predict them.

Data Preprocessing and Exploration
The data provided by the ANDI database (Alzheimer's Discase Neuroimaging Initiative) consists of participants aged 55-90 years old who are able to provide independent functional assessments and have been screened for specific psychoactive drugs.The data includes RID, EXAMDATE, DX_bl, AGE, PTGENDER, PTEDUCAT, PTETHCAT, and more.

Data Preprocessing
In this study, the proportion of missing values in each variable was first examined.Variables with missing values greater than 0.5 were removed as they had no research significance.The remaining missing values were then processed by replacing numerical values with their mean and storing nonnumerical values in an "append" column.For variables with missing values greater than 0.1, if it was a missing value, it was labeled as 0; otherwise, it was labeled as 1.Next, the variables CN, AD, LMCI, SMC, and EMCI in the categorical variable were assigned numbers 0 to 4 for further analysis.Data processing was completed.

Data Exploration
To explore the processed data, this study plotted histograms, density probability graphs, and box plots to visualize the distribution of the data.Additionally, using Spearman's coefficient, a heatmap of correlation coefficients was created to examine the correlation between variables.Due to the limited space in this paper, only some data is presented..2),we can observe that the age follows a normal distribution, indicating that the selected data have a certain generalizability.However, other labels do not conform to the normal distribution.The box plot (Figure .3)reveals that there are some outliers in some data, but since this sample size is relatively large, they can be ignored.By observing the heatmap (Figure .4),we can see that some data have strong correlations, but most data do not show significant correlations.

Introduction to the Base Models
In this study, two base models were chosen for multi-class classification: Logistic Regression and Random Forest.The Logistic Regression model uses a multi-class logistic regression approach, treating each class as a binary problem compared to the remaining classes.For N classes, N-1 binary classifications are performed, resulting in N-1 binary models.The probability of each binary classification is calculated, and the class with the highest probability is used as the predicted result for the new sample.
The Random Forest model is an ensemble learning method that builds multiple decision trees from a random sample of the original dataset.In this study, we used Bootstrapping to randomly select n training samples from the original dataset for each iteration.A total of k iterations are performed, resulting in k training sets that are mutually independent.Each training set is used to train a single decision tree model.Finally, for the multi-class problem, these k decision tree models are used for classification by majority voting [4][5][6].
To address issues with irrelevant or lowly correlated variables affecting model accuracy, we employed a REF model based on the Random Forest model to select the most relevant indicators.Nine indicators with the highest correlation were selected and used as input features for the Random Forest model for prediction.

Model Solution
In this study, the processed data was used to train both Logistic Regression and Random Forest models.The results are presented in Tables 2 and 3.    From Table 5 and Figure 5, we can observe that the accuracy and recall rate of the REF-Random Forest model after being processed by the SMOTE algorithm have increased from approximately 90% before to 100%.It accurately classifies the data, indicating that this optimization was very successful.We then plotted the ROC curve for each category before and after optimization, as shown in Figures 6 and 7.It can be seen that the results were already good before optimization, but after optimization, they reached perfection.

Conclusion
This paper concludes that factors significantly affecting Alzheimer's disease include AGE, EcogSPPlan_bl, LDELTOTAL, EcogPtPlan_bl, EcogPtOrgan_bl, TRABSCOR, and EcogSPOrgan_bl.By comparing the accuracy and recall rate of the REF model with those of logistic regression and random forest models, this study finds that the REF-random forest model has the best prediction effect.Finally, after using the SMOTE algorithm to optimize the REF-random forest model, both recall rate and accuracy have reached 100%.As shown in Figure 5, the confusion matrix also indicates that the model accurately classifies each category.Hospital examination reports can be imported into this model to predict which category a patient belongs to (CN, AD, LMCI, SMC, EMCI), thus enabling early detection and timely treatment.This will delay disease progression, improve quality of life, and reduce the burden on both individuals and society as a whole.

Figure 4 :
Figure 4: Heatmap of correlation coefficients.Through the histogram (Figure.1)and density graph (Figure.2),we can observe that the age follows a normal distribution, indicating that the selected data have a certain generalizability.However, other labels do not conform to the normal distribution.The box plot (Figure.3)reveals that there are some outliers in some data, but since this sample size is relatively large, they can be ignored.By observing the heatmap (Figure.4),we can see that some data have strong correlations, but most data do not show significant correlations.

Figure 5 :
Figure 5: Confusion matrix of REF-Random Forest after optimization with SMOTE model.

Figure 6 :
Figure 6: ROC curve of REF-Random Forest model.

Table 1 :
Mild Cognitive Impairment.MCI participants maintain daily activities and show no significant impairment in other cognitive domains without evidence of dementia.The MCI level is determined using the Wechsler Memory Scale Logical Memory II (either early or late).Shows some partial data.

Table 2 :
Logistic Regression Classification Evaluation Report.

Table 3 :
Random Forest Classification Evaluation Report.