Classification Technology of GC-MS Map Data of Baijiu Based on Sparse Principal Components

: In order to achieve accurate identification of GC-MS Baijiu mapping data, the sparse principal component analysis (SPCA) of GC-MS Baijiu mapping data is achieved by introducing the elastic net penalty function and ridge regression to restrict the sparse principal components on the basis of the principal component analysis method. The sparse principal components are fed into different classifiers for classification and identification, and a Baijiu quality classification model is established. Through comparison experiments, it was demonstrated that sparse principal components better represented the information of different characteristics of liquor, and the classification recognition accuracy after classification was higher, and the recognition rates of SPCA+KNN, SPCA+DT, SPCA+SVM, and SPCA+BP reached 62%, 89%, 97% and 100%; the differences of sparse principal components of GC-MS profiles of different grades of liquor were greater than the differences of principal components, and the sparse principal components of GC-MS profiles of liquor was a nonlinear relationship. The established sparse principal component-based Baijiu quality evaluation model can effectively realize the evaluation of Baijiu grades, which provides a more effective and objective method for the control of Baijiu quality and grade identification.


Introduction
The acid, alcohol, ester, aldehyde, ketone and other trace components in Baijiu account for about 2% of the main components, but have a greater impact on the flavor of Baijiu.For a long time, the evaluation of Baijiu characteristics is mainly by sensory evaluation, which will inevitably be affected by human factors.Therefore, if scientific methods are used to improve the objectivity and accuracy of Baijiu characteristics evaluation, it is an urgent problem to be solved [1] .
At present, the evaluation methods of Baijiu characteristics mainly obtain Baijiu map data through sensor detection technology, and use feature extraction and recognition technology to achieve the classification of Baijiu characteristics.Common methods include: Electronic nose mass spectrometry (EN-MS) [2,3] , Gas chromatography mass spectrometry (GC-MS) [4,5] , inductively coupled plasma (ICP), Fluorescence spectroscopy technology [6,7] and techniques such as colorimetric artificial nose method.Among them, GC-MS combines the separation ability of chromatography and the qualitative ability of mass spectrometry, which can conduct qualitative analysis of multi-component mixtures in a relatively short time, thus effectively reflecting the trace components.It is the most common method in the evaluation of Baijiu characteristics.Li Xinfeng et al. [8] used gas chromatography technology to evaluate the quality grade of Luzhou flavor Baijiu, find the micro relationship between Baijiu ingredients, and provide new ideas for the micro research of Baijiu; Zhang Qi et al. [9] used GC-MS technology to obtain 32 characteristic peaks contained in several types of Luzhou flavor Baijiu.The content and proportion of flavor components represented by the characteristic peaks of various Baijiu components significantly affected the flavor and flavor of Baijiu; Xun Siying et al. [10] established a common peak map model for Baijiu by combining GC-MS determination technology with map analysis software.
The principal component analysis (PCA) is mainly used for dimension reduction of Baijiu map data based on GC-MS to achieve the main feature extraction of map data [11] .PCA uses several orthogonal principal components to represent the complete information of Baijiu map data, and achieves the purpose of data dimension reduction on the basis of maximum retention of data information [12,13] .In recent years, with the improvement of sparse representation theory, sparse representation technology is gradually applied to the recognition of Baijiu atlas [14] .To better realize the sparse decomposition of data and reflect the weight of each component, this paper uses sparse principal component analysis (SPCA) to analyze each component of Baijiu.SPCA uses the method of aggregated data sparsity to distinguish principal components, and the unit feature vector corresponding to each principal component is filled with as many zeros as possible, so that fewer linear combinations of variables can be used to represent the original data, so as to better achieve data dimensionality reduction [15] .
In this study, SPCA was applied to reduce the dimensions of Baijiu GC-MS map data, making the map data more sparse in the principal component space.The main information of GC-MS map data was screened through the cumulative contribution rate, and then K nearest neighbor (KNN), decision tree (DT), support vector machine (SVM) Various classification algorithms such as error back propagation (BP) classify the characteristic grades of Baijiu, and compare each algorithm with PCA classification algorithm.

Sparse Principal Component Analysis Method
Sparse principal component analysis method is an improved data statistical analysis method that adds sparsity conditions on the basis of principal component analysis method.It uses the most representative linear combination method of a few variables to represent the original data, and converts it into a regression problem with quadratic penalty, which can better simplify the load data and reduce the dimensions of the data.For the problem of solving sparse principal components, Lasso regression can be used to transform the problem into a variable selection problem, and an elastic network penalty function with a linear combination of ridge regression and Lasso penalty can be used to obtain sparse principal components.
The steps of the SPCA algorithm are as follows: (1)The feature corresponding to the first m principal components of PCA is   (2)Solving the regression problem for a given (4)Repeat the above processes ( 2) and ( 3) until convergence occurs. (5)Standardization Calculation of SPCA variance contribution rate: Assuming Z ˆ as the extracted principal component, the total variance explained by Z ˆ can be be the adjusted residual: Therefore, the adjusted variance of , and the total explanatory variance is This article adopts a sparse principal component algorithm based on elastic net penalty structure, which uses Lasso penalty with elastic net to continuously modify the regression optimization framework and obtain sparse principal components.Compared with PCA, SPCA has added sparsity conditions on the basis of PCA, which enhances the interpretation ability of PCA by sparsizing the data information payload through variance contribution rate; SPCA combines sparsity conditions to comprehensively analyze the principal component variance and load of PCA partitioning, improving the data processing ability of PCA.Finally, the data principal components generated by SPCA decomposition are not correlated.Therefore, using SPCA can better interpret and statistically analyze white wine data.

Experimental Methods
The sample data of Baijiu used in the experiment were selected from various distilleries in southern Sichuan, with 20 samples, which were numbered as T1, T2, T3, T4, T5, Y1, Y2, Y3, Y4, Y5, R1, R2, R3, R4, R5, S1, S2, S3, S4, S5, respectively.The GC-6800 gas chromatography-mass spectrometer produced by Jiangsu Tianrui Instrument Co., Ltd. is selected as the experimental instrument, and the DB-WAXMS chromatographic column (30 m) produced by Agilent Technologies Co., Ltd. is assembled × 0.25 mm × 0.25 µ m).The sensory evaluation of Baijiu is conducted with reference to the national standard GB/T10345-2007 Analytical Methods for Baijiu, and the sensory evaluation personnel are composed of wine tasters in southern Sichuan.The number order dark evaluation method is adopted, and the wine tasters comprehensively evaluate Baijiu, with a total score of 100 points.The average of the scores of each wine taster is the score of the Baijiu sample.Among them, scores ranging from 93.0 to 100.0 are classified as special grade, 88.0 to 92.9 are classified as first grade, 80.0 to 87.9 are classified as second grade, 70.0 to 79.9 are classified as excellent grade, and scores below 70.0 are classified as others.The sensory evaluation of the 20 samples is shown in the table 1. Qualitative method of Baijiu flavor: measure 5 mL of Baijiu sample, add 100% mixed internal standard solution μ L (Amyl acetate 15.10 g/L, tert amyl alcohol 15.19 g/L, 2-ethylbutyric acid 15.09 g/L), mixed evenly and detected by GC-MS, in which the gas chromatographic condition: injection volume 1 μ L; Split ratio 20:1; Injection port temperature 250 ℃; Heating program: The initial temperature is 35 ℃ for 10 minutes, then it is raised to 120 ℃ at 2 ℃/min, then to 200 ℃ at 5 ℃/min, and finally to 245 ℃ at 10 ℃/min for 40 minutes; The carrier gas is high-purity helium (He), with a flow rate of 1 mL/min.Mass spectrometry conditions: GC-MS interface temperature 280 ℃; Ion source temperature 230 ℃; Scanning quality range 29-500 m/z; Electron impact ion source (EI); Ionization energy 70 eV.Finally, after SPCA dimensionality reduction of the obtained sample data, Baijiu samples are classified using multiple classification algorithms such as KNN, DT, SVM and BP.

Result and Analysis
The volatile flavor components of 20 Baijiu samples were analyzed by GC-MS, and the volatile flavor components of a Baijiu sample were randomly selected for analysis.See Table 2 for the GC-MS analysis results of volatile flavor components of a Baijiu sample.2, removing the principal components with significant data deviation and irrelevant variables, while retaining the 10 principal components with significant feature values.The results are shown in Tables 3 and 4.  3 and 4, it can be seen that the cumulative variance contribution rates of the first 8 principal components and the first 7 sparse principal components extracted using PCA and SPCA methods are 92.813% and 93.580%, respectively.Therefore, in subsequent experiments, this article can select the first 8 principal components and the first 7 sparse principal components for analysis.From Figure 1, it can be seen that after testing on 20 test samples, the number of false positives using PCA combined with KNN, DT, and SVM methods was 14, 4, and 1, respectively, with classification accuracy of 30%, 80%, and 95%.However, the number of false positives using SPCA combined with KNN, DT, and SVM methods was 1, 0, and 0, respectively, with classification accuracy of 95%, 100%, and 100%, respectively.It can be seen that SPCA method can more accurately extract principal   The number of false positives for SPCA combined with KNN, DT, SVM, and BP methods was 57, 17, 5, and 0, respectively, with classification accuracy rates of 62%, 89%, 97%, and 100%.The classification verification shows that the SPCA method can effectively extract the main characteristics and information of different grades of Baijiu, and combined with the nonlinear BP classification method, the accuracy of SPCA combined with BP method for different grades of Baijiu can reach 100%.

Conclusion
In this study, GC-MS was used to analyze the volatile flavor components of four Baijiu samples of different grades by PCA and SPCA.The results showed that, compared with the principal components, the sparse principal components of GC-MS maps of different grades of Baijiu were significantly different, which could better represent the information of Baijiu with different characteristics.Subsequently, Baijiu samples were classified based on PCA and SPCA, respectively in combination with KNN, DT, SVM and BP.The results showed that the classification accuracy of Baijiu samples based on SPCA was higher, and the accuracy of SPCA+KNN, SPCA+DT, SPCA+SVM, and SPCA+BP reached 62%, 89%, 97%, and 100%, respectively.The classification effect of SPCA combined with BP was the best, indicating that the sparse principal component coefficients of Baijiu GC-MS were non-linear.The experimental data were verified with the sample data, and the classification results and data information were completely consistent, indicating that the established Baijiu quality evaluation model based on sparse principal component analysis could effectively realize the evaluation of Baijiu grades, providing a more effective objective method for Baijiu quality control and grade identification.
(a) Scatter plot of PCA+KNN classification results (b) Scatter plot of SPCA+KNN classification results (b) Scatter plot of PCA+DT classification results (d) Scatter plot of SPCA+DT classification results (e) Scatter plot of PCA+SVM classification results (f) Scatter plot of SPCA+SVM classification results (g)Scatter plot of PCA+BP classification results (h) Scatter plot of SPCA+BP classification results

Figure 1 :
Figure 1: Classification results of PCA and SPCA combined with four classification methods After principal component analysis (PCA) and sparse principal component analysis (SPCA) feature extraction, 20 Baijiu samples with known grades were classified according to the common classification methods of KNN, DT, SVM and BP, and compared with the grade of Baijiu evaluated by artificial Sensory analysis.From the collected Baijiu data, 170 groups of Baijiu data with different grades were selected as the training set, and another 20 groups (5 groups for each grade) of Baijiu with different grades were randomly selected as the test set.PCA and SPCA were used to process the Baijiu test set, and the extracted principal component characteristics were used as input data to establish a standard grade classification model to identify the basic attributes of the test set data.The first 8 principal components of Baijiu and the first 7 sparse principal components of Baijiu are selected as the input data of KNN, DT, SVM and BP methods, and the classification results are shown in Figure 1.From Figure1, it can be seen that after testing on 20 test samples, the number of false positives components and has stronger feature extraction ability.The methods of PCA, SPCA, and BP all have no false samples, and the classification accuracy is 100%.It can be seen that the nonlinear BP method can effectively improve the recognition accuracy.The results indicate that PCA and SPCA can achieve dimensionality reduction and information interpretation in data processing to varying degrees, and SPCA has strong advantages over PCA in data dimensionality reduction, redundancy removal, preservation of original data information, and actual interpretation ability.Using the model obtained from pre training, 150 samples of Baijiu with known grades were randomly selected from the collected Baijiu data outside the training set, and GC-MS atlas test was conducted.The test results are extracted by SPCA, and the sparse principal components extracted are used as input data, which are classified by KNN, DT, SVM and BP.The classification results of each method are shown in Figure 2.
(a) Scatter plot of SPCA+KNN classification results (b) Scatter plot of SPCA+DT classification results (b) Scatter plot of SPCA+SVM classification results (d) Scatter plot of SPCA+BP classification results

Figure 2 :
Figure 2: Classification results of four classification methodsIt can be seen from Figure2that 11 of the 150 Baijiu samples taken are not super Baijiu, first grade Baijiu, second grade Baijiu, and superior Baijiu; The number of false positives for SPCA combined with KNN, DT, SVM, and BP methods was 57, 17, 5, and 0, respectively, with classification accuracy rates of 62%, 89%, 97%, and 100%.The classification verification shows that the SPCA method can effectively extract the main characteristics and information of different grades of Baijiu, and combined with the nonlinear BP classification method, the accuracy of SPCA combined with BP method for different grades of Baijiu can reach 100%.

Table 1 :
Results of sensory evaluation of Baijiu samples

Table 2 :
46 kinds of trace components and their contents in Baijiu based liquor samples

Table 2 ,
46 volatile flavor components were detected in this Baijiu sample.PCA and SPCA were used to extract features from 46 trace components in Table

Table 3 :
The eigenvalues, contribution rates and cumulative variances of the 10 principal components

Table 4 :
The eigenvalues, contribution rates and cumulative variances of the 10 sparse principal components