Research on flight technology evaluation based on machine learning algorithm

: In China's civil aviation transportation industry, flight safety has been the focus of attention. In this paper, a flight technology assessment model and an automated early warning model are established for aviation safety. First, data pre-processing is performed. Then the suitable indicators are continuously screened by multiple machine learning classifications, and then the screened data are fitted to continuously screen the suitable indicators, and the aircraft technology assessment is found to be more suitable for the integrated learning classification model. Subsequently, three unoptimized optimal models were derived as LightGBM, XGboost and Random Forest classification models. The results of these models are then fused by Stacking model to combine their advantages to build the final aircraft technology assessment prediction model. For the automated early warning mechanism, the aviation early warning mechanism needs to be established first by subclassing these data with the K-mean clustering model and visualizing the key data items such as avg (COG NORM ACCEL) based on the normal distribution, combined with the differentiated distribution for each category to set the implausible warning level to establish the aviation automated early warning model.


Introduction
The recent "March 21" air disaster has raised concerns about flight safety.Flight safety big data, such as Quick Access Recorder (QAR) data, which records aircraft flight parameters during the flight, is an important way for airlines to obtain aircraft flight parameters.At present, in the field of flight quality monitoring, it mainly involves the research and application of overrun events, but there is a deficiency in the analysis of overrun events and a lack of in-depth analysis of the causes of overruns.Therefore, there is a need to mine QAR full flight segment data to form flight quality records of specific personnel, and carry out targeted safety management, identify safety hazards and improve safety performance through data modeling, analysis, calculation and assessment of risk propensity.
Flight quality monitoring is an important task to ensure the stable and sustainable safety development of the civil aviation industry, and has an important role in preventing flight safety accidents in advance.Quick storage recorder (QAR) provides comprehensive and complete data for flight risk study by recording parameters such as position, flight attitude and flight operation during flight.The process of aligning an aircraft to the runway during the landing phase is called an approach, and an unstable approach can easily lead to serious events such as heavy landings, tail wipe, and runway deviation during the landing phase.Therefore, the risk analysis and accurate warning of unstable approach events can effectively guarantee the safety of aircraft landing.Therefore, this paper establishes a flight technology assessment model and an automated early warning model for aviation safety [1].

Data pre-processing (1) Disposal of redundant indicators
In this paper, we found that there are some indicators such as model, date, etc.These characteristic variables have no practical significance for model building, so they are directly deleted.
(2) Feature Code For categorical data, such as raw data, the values span large and have multiple forms, as only numeric types can be computed.Therefore, for various special feature values, we need to encode them accordingly.Here we use Label encoding to encode the original feature values into custom numeric labels to complete the quantization encoding process.Such as V2_Method, we encode these R as 1 and C as 0.
(3) Missing value handling Visualizing the missing values by python, there are only 2369 data in total.The null heat map of the aircraft parameter measurement data is shown in Figure 1.For data with outliers in the metrics, we use KNN classification method to fill them.

Benchmark model building and solving
After performing missing value processing, outlier processing, and steps on the data, we obtained a new copy of the data.In calculating the model results, we combined the first question similar steps through multiple machine learning models important features to screen the main indicators, constantly screen the appropriate indicators, the data will be fitted to the sample data through integrated and linear models and other machine model algorithms to take 10-fold crossvalidation method to get the relationship between each feature and the flight technology.Subsequently, three optimal models without optimization are derived as LightGBM, XGboost and random forest, respectively.The models are continuously optimized in terms of parameters, as well as changing indicators, and then the model fusion approach is adopted to fuse the results of LightGBM, XGboost and random forest by Stacking model, combining their advantages to build a final prediction of a flight based the optimal model is selected as the baseline model.The optimal model is selected as the baseline model, and the optimization is continued based on the baseline model.The model is built by Python 3.9+, scikit-learn, and the performance of the model is measured by Accuracy, Recall, Precision, and F1 values, which are calculated as follows. Accuracy: ( Recall: (2) Precision: (3) F1-Score: Where the symbolic meaning is as follows.TP (True Positive): Correct positive case, an instance is a positive class and is also determined to be a positive class.FN (False Negative): Wrong negative example, false positive, an instance is positive but judged to be false.FP (False Positive): False Positive, false positive, false class but determined to be positive.TN (True Negative): Correct counterexample, an instance is a false class and is also determined to be a false class.
Since the F1 value of the model metric is a composite metric, only the Accuracy and composite F1 values are shown below.Table 1 below shows the three model metrics for the flight technology assessment.

Random Forest Classification Model
Random forest belongs to a kind of integrated learning model, i.e., by building multiple learners to accomplish the learning task together, a set of base learners is generated first, and then some strategy is used to combine these learners, and decision trees are often used as base learners because they are weak learners themselves, but after integration, they often have strong prediction effect and become strong learners.Random forest is one of the decision tree integration models, which uses CART decision trees and is widely used because of its good prediction effect and stability [2].
The modeling computation process of random forest classification algorithm is: (1) Draw the training set from the original sample set.In each round, n training samples (with put-back sampling) are drawn from the original sample set using the Bootstraping method.A total of k rounds are performed to obtain k training sets.(The k training sets are independent of each other) (2) One model is obtained using one training set at a time, and a total of k models are obtained from k training sets.
(3) For the classification problem: the k models obtained in the previous step are used to obtain the classification results by voting.
The random forest optimized metrics are shown in Table 2.

Model Building
When training the model, the XGBoost algorithm [4] generates each tree by determining whether the node has "gained" before and after the node split, and then specifying whether the node is split or not, while controlling the depth of the tree through parameters.After a decision tree is generated, it needs to be pruned to prevent overfitting.The tree generated in the mth round learns the residuals between the true value and the predicted value in the m-1th round, so that the model prediction gradually approximates the true value.
XGBoost has several benefits: (1) A regularization term is added to the objective function to reduce the possibility of overfitting, and not only the first-order derivative is used, but also the second-order derivative is used.
It also uses second-order derivatives, which makes the loss function more accurate and allows customizing the loss.
(2) Parallel optimization is possible, and XGBoost is parallel in terms of feature granularity.
(3) Considering the handling of sparse values, the ability to set the default direction of branching for missing values or specified values greatly improves the efficiency of the algorithm.
(4) Column sampling is allowed, which can suppress overfitting and reduce computational effort at the same time.

Parameter tuning and solving of the model
We first call Python to wrap XGBoost to fit customer satisfaction, and then perform parametric tuning optimization on XGBoost.In the parameter tuning of XGboost, we divide the main steps into three steps: Step 1: When building the XGboost model, we first construct the model with the default values of each parameter and calculate the evaluation criteria of the initial model.
Step 2: The parameters learning_rate, n_estimators, and max_depth are tuned.The learning_rate is the learning rate, default is 0.3, which is used to control the iteration rate and suppress overfitting in the classification task; n_estimators is the number of boosting iterations (the number of weak classifiers); max_depth is the maximum depth of the tree, which is usually used to avoid overfitting.max_depth depth is the number of iterations of boosting (number of weak classifiers); max_depth is the maximum depth of the tree.The tuning of the regularization parameters lambda, alpha, which reduce the complexity of the model and thus improve the performance of the model.
Step 3: The optimal model is constructed using the optimal combination of parameters selected in step 2, and the evaluation criterion values of the optimized model are calculated, and the fit of the optimized model is significantly improved compared with the pre-optimized model.
The calculated optimized metrics are shown in Table 3 below.

Model Building
LightGBM is an efficient implementation of XGBoost.The idea is to discretize the continuous floating-point features into k discrete values and construct a histogram of width k.The histogram is then traversed through the training data to calculate the cumulative statistics of each discrete value in the histogram.The training data is then traversed and the cumulative statistics of each discrete value in the histogram is calculated.For feature selection, only the discrete values of the histogram need to be traversed to find the optimal segmentation points; and the use of a leaf-wise strategy with a depth limit saves a lot of time and space expenses.
LightGBM is a high-speed, distributed, and high-performing gradient boosting framework based on decision tree algorithms for sorting, classification, regression, and many other machine learning tasks.The core ideas of LighGBM algorithm are: histogram algorithm, Leaf-wise splitting strategy, direct support for class features, histogram-based feature optimization histogram Histogram-based feature optimization [3].

Parameter tuning and solving of the model
The LightGBM model has more parameters, but the tuning rules are similar to those of every decision tree based model, first determine 0.1 as the initial learning rate, which can make the model converge faster.Then, we determine the number of decision trees and the maximum depth of decision trees by grid search, then determine the maximum number of leaf nodes, and finally adjust the minimum number of samples of leaf nodes to prevent overfitting of the model.
The optimized metrics are obtained in the following table 4.

Model fusion solving
Stacking [5] is to use the initial training data to learn several base learners and then use the prediction results of these learners as a new training set to learn a new learner.In this paper, we have trained three base learners, Random Forest, LightGBM and XGBoost, and the output of these three learners is used as a subset training set for the secondary learners.For the secondary learners, we take the classification model because our base learner is a strong learner, and the secondary learners choose a simple model to avoid the overfitting phenomenon because the learning effect of different learners is combined, and the stacking method can make the final fused model more stable and perform better.The summary of the metric scores of each model is shown first, and the fused model is found to be better than the strong learner alone from the training effect, and the fused fits of Random Forest, LightGBM and XGBoost and all three are shown in the table 5:

K-means clustering modeling
The K-means algorithm proposed by B. MacOueen in 1967 is by far one of the most influential techniques among the many clustering algorithms used for scientific and industrial applications.
① For the set of data objects, K objects are arbitrarily selected as the initial class centers.For each of these five eigenvalues we have selected for normal distribution visualization, only avg COG NORM ACCEL and avgROLL ATT normal distribution visualizations are shown in figure 3:

②
We reassign each object to the most similar class based on the mean of the objects in the class.③ We update the class average, i.e., calculate the average value of objects in each class.④ We repeat step (②) and (③) until no more changes occur.The combination of Python and SPSSpro yielded the following results, for the above feature variables were divided into a total of five categories, and the clustering visualization was as shown in figure2:

Figure 2 :
Figure 2: Aircraft parameter measurement data null heat map.It is obvious that avg (COG NORM ACCEL), PITCH ATT RATE, avgROLL ATT, PITCH ATT RATE, Inertial Vertical Speed are divided into 5 categories, and category 1 has the largest weight.For each of these five eigenvalues we have selected for normal distribution visualization, only avg COG NORM ACCEL and avgROLL ATT normal distribution visualizations are shown in figure3:

Table 1 :
Metrics for the three models of overall satisfaction with voice calls.