Research on Stock Price Prediction Based on Orthogonal Gaussian Basis Function Expansion and Pearson Correlation Coefficient Weighted LSTM Neural Network

: For stock price prediction in quantitative finance, deep learning techniques such as LSTM neural network do not need the stationarity assumption of traditional time series models (such as ARIMA and GARCH) and can forecast medium and long-term time series, so they have attracted much attention. This paper proposes an improved LSTM neural network based on orthogonal Gaussian basis function expansion and Pearson correlation coefficient weighting. The proposed method uses the functional features of intra-day prices to fit the residual series predicted by the LSTM neural network. Considering that the underlying model structure between each component of the function eigenvector and the residual series is unknown, we use the Bagging method to capture and trade off the variance and bias of the prediction model. In addition, since the dimension of the predictive variable of the LSTM neural network is a parameter to be estimated, we use the model averaging method based on Pearson correlation coefficient weighting for tuning. The results of actual data a nalysis show that the proposed method can significantly improve the prediction accuracy of the original LSTM neural network and has certain robustness. Finally, the proposed method can be further applied to consumer price index (CPI) prediction, daily average temperature prediction, and real-time monitoring of environmental trace elements.


Introduction
As China's modern market economy is becoming more and more developed, people's financial management consciousness is increasingly mature, and finance, as the core of the modern economy, gradually becomes people to focus on hot areas. The stock market, as a barometer of the national economy, finance and investment, naturally becomes the most direct; the wave profile reflects the trend of the economy, affects the entire market of nerve fibers, and is one of the key research directions of many experts and scholars. Stock price volatility in the stock market and the influencing factors and more complex; at the same time, the stock price is a big noise, high dimension, information is not easy to capture the characteristics of time series. Therefore, a problem for academia and industry to focus on how to accurately reveal the changing trend of the time series of share prices to forecast the stock reasonably.
For stock price series prediction, there are many classical time series prediction models. The classical differential autoregressive moving average (ARIMA) model and generalized autoregressive conditional heteroscedastic (GARCH) model have been widely used. For example, the combination of AMIMA and MLPs (multi-layer perceptron) is used to forecast S&P 500 index, Shenzhen Component index, and Dow Jones Index (Rahimi, Z.H.Etter.,2018). [1]; Accuracy of improved GARCH family model in stock market prediction (Wanrui,2022) [2]. However, when the above classical model is established for time series prediction, the time series data must be stable or stable by differencing; otherwise, it will not be able to capture the law. In addition, the model can only capture linear relations but not nonlinear relations and functional relations [3][4]. With the development of machine learning methods, the ARIMA model combined with kernel principal component analysis (KAPC) can achieve nonlinear dimensionality reduction of data, which can mine the nonlinear information contained in the data set and improve prediction accuracy. However, the data structure between the dimensionality reduction variables and the response is unknown (Zheng Hong et al., 2020). [5]. Compared with traditional prediction methods, LSTM (Long shortterm Memory) neural network (Bao Yueyan, 2021) [6] BP Neural network (Zeng Lifang et al., 2020) [7] Deep learning methods, such as more good at dealing with time-series data, through them to predict stock price time series, such as no stationarity assumption, need not consider the problem of dimension disaster, but also by capture the nonlinear activation function information, but also they are not able to capture the function information, the predictor variable dimension is not easy to choose (GUI-jun Yang, etc., 2022) [8]. Therefore, both the traditional classical model and the neural network model have their own advantages and limitations. The traditional classical model can well explore the implicit linear relationship in the data, while the neural network model has its own merits in dealing with nonlinear relationships and great dimensional problems. However, stock price prediction needs to combine the advantages of the two types of models to build a combined model. The common idea of composite model construction is to decompose the data, fit the linear and nonlinear parts by statistical model and neural network model, respectively, and then superimpose to get the prediction results. For example, Zhang (2003) [9]. A combined model of ARIMA and neural networks is discussed. In the first stage, ARIMA captures the linear trend in the time series data, and then, based on the output of the previous stage, ANN is used to capture the nonlinear relationship in the residual series. Based on this idea, a large number of scholars have constructed portfolio models to predict financial time series and confirmed the advantages of portfolio models over single models (Anna et al.,2021) [10].
In summary, this paper proposes a combined prediction model based on orthogonal Gaussian basis function expansion and Pearson correlation coefficient weighted LSTM neural network. The proposed method has the following advantages: first, it does not need stationarity assumption and can extract not only linear and nonlinear information but also add functional auxiliary information to the original time series prediction by considering the intraday price and basis expansion method. Second, the latent model structure of the feature components and residual series is determined by the Bagging method. Thirdly, it provides a parameter selection method based on model averaging for the LSTM neural network to select the appropriate dimension of predictor variables. Experimental results show that the proposed method has higher accuracy and robustness than the original LSTM.

Gaussian basis function expansion
Assume there is an independent variable, and consider the functionalization of its data without considering the dependence for the sign in the variable, i.e. ( ) . Assume that the observations in the main body are derived from the regression model: Where the residuals obey an independent normal distribution, ik a denotes the coefficients, ( ) k u t is a set of orthogonal basis functions, each of which forms a local acceptance domain in the input control, and the specific expression for the Gaussian basis function is Where k µ is the location of the decision center, 2 k η is the discrete parameter, and v is the hyperparameter. A clustering algorithm is first used to determine the centers and discrete parameters of the Gaussian basis functions, and a method of constructing Gaussian unitary orthogonal basis is used in this position, and the procedure is as follows.
After orthogonal basis functions, the regularization method is penalized by maximizing the loglikelihood function and the maximum penalized likelihood estimator is: In practice, the GIC is used for each curve to obtain the optimal number of basis functions.
Finally the coefficient matrix is obtained:

Improved LSTM neural network based on Pearson correlation coefficient weighting
In this paper, the intraday price is considered as auxiliary information and then as a function, and the orthogonal Gaussian basis function expansion is used to extract the feature information of the function. Furthermore, since the underlying model structure between each component of the feature vector and the residual sequence is unknown, we use the Bagging method to capture the feature information. The bagging ensemble algorithm is a technique to improve the generalization error by combining multiple base models. In addition, LSTM is a time-recurrent neural network obtained by optimizing RNN. It can effectively overcome the problem of vanishing gradients in RNN and outperforms RNN, especially in long-distance dependent tasks. However, since the selection of the prediction period of the LSTM neural network is a problem to be solved, the model averaging method based on the Pearson correlation coefficient is used in this paper to measure the prediction accuracy and evaluate the importance of each model. In this paper, this combined model prediction method is called LSTM neural network based on Orthogonal Gaussian Basis Function expansion and Pearson correlation coefficient weighting (GBM-LSTM). The specific algorithm process is as follows: Step.1: Obtain the stock opening price and its corresponding intraday price data; Step.2: Set multiple forecast periods, and use LSTM neural network for each forecast period to obtain the LSTM neural network predicted value sequence set.
Step.3: By using the difference between the true value of the opening price and the predicted value of multiple LSTM neural networks, multiple residual sequences are obtained Step.4: Gaussian basis expansion and Bagging regression were used to predict the residual series, and the residual predicted values of each LSTM neural network were obtained, and the residual predicted values were added to the original predicted values.
Step.5: Pearson correlation coefficient is used to calculate the correlation coefficient between the predicted value sequence and the real value sequence on the training set and then normalized to obtain the weight vector of the LSTM neural network set.
Step.6: Inner product of the new predicted value vector obtained from multiple LSTM neural networks with the weight vector to obtain the final predicted value.

Data sources and pre-analysis
The data of this paper comes from Wind database (https://www.wind.com.cn/), and the daily stock prices of three Chinese A-share markets, namely, Trendy Energy (SH600777), Oriental Group (SH600811) and Opai Household (SH603833), are selected as samples. They belong to oil and gas extraction industry, agricultural and sideline food processing industry and custom furniture industry respectively. The reason for selecting these three stocks is that they have different stationarity and complexity degree, so as to test whether the stock price prediction model based on orthogonal Gaussian basis function expansion and Pearson correlation coefficient weighted LSTM neural network has the same excellent improvement effect for stocks with different stationarity and complexity degree. See Figure 1 Specifically, the time series trend of the three stocks is shown in FIG. 1. Firstly, by observing the time series diagrams of the three stocks, we can preliminarily judge that the original numbers are all unstable. The test results show that the absolute values of the ADF test statistic of the three stocks are -2.723, -1.252 and -2.395 respectively, which are less than the critical value at the significance level of 1%. Therefore, the null hypothesis is not rejected and there is a unit root. Therefore, at the significance level of 0.05, the data of the three stocks can be considered as unstable time series data. Figure 1: The opening trend of stocks SH600777, SH600811 and SH603833

Comparison Results
The experiment is implemented based on TensorFlow framework. Select three stocks sample number is 242, frequency intraday price was $240, to test robustness, we select different training periods: t = [150,155,160,165,170,175,180,185,190,195,200,205,210,215,220 Where k e is the residual sequence, k Y is the true value, 1 S is the standard deviation of the original sequence, 2 S is the standard deviation of the relative value sequence. The smaller the three indicators of the model, the higher the prediction accuracy.     Table 1, Table 2 and Table 3 show the comparison of three stock evaluation indexes, and Table 4 shows the difference test of evaluation indexes. The first three columns of Table 1, Table 2 and Table 3 show the error of using our improved model to predict stock SH600777, while the last three columns show that using the original model to predict stock SH600777. Obviously, the value of the improved model is significantly smaller than that of the original model from the value of the three evaluation indexes. In addition, in Table 4, t represents the Test Statistic, namely the t-statistic; P stands for p-value, that is, p-value, representing the probability value corresponding to t-statistic; Obviously, all the three models reject the null hypothesis and accept the alternative hypothesis, that is, there is a significant difference between the two models and the improved model is significantly better than the original model.
Investigate its reason, one is the improved model does not need stationarity assumption, it not only can extract the linear and non-linear information, and by considering the intraday price and base expansion method, can increase function auxiliary information of the original time series prediction, but the original model using original data analysis, due to the stock price data is not smooth, which will lead to larger error; Second, the underlying model structure of the feature components and residual series of the improved model can be determined by the Bagging method, which greatly reduces the error of the model structure through the idea of integration. Thirdly, the improved model provides a parameter selection method based on model averaging for LSTM neural network to select the appropriate dimension of predictor variables, which increases the accuracy of the model. In conclusion, we believe that the stock price prediction model based on the improved LSTM neural network has more excellent forecasting ability than the original model. Obviously, the prediction results of the latter two stocks are almost the same as the prediction results of the first stock (SH600777), and the improved model has a better prediction level than the original model.

Conclusion and Discussion
In this paper, the LSTM neural network based on orthogonal Gaussian basis function expansion and Pearson correlation coefficient weighting is used to model and predict three stock prices. Compared with the traditional model, it has significant advantages: Firstly, it does not need to deal with the data stationarity like the classical model. Furthermore, it can not only extract the linear and nonlinear information, but also add the function auxiliary information to the original time series prediction through the intraday price and basis expansion method. Secondly, the latent model structure of feature components and residual series is determined by Bagging method. Finally, it provides a parameter selection method based on model averaging for LSTM neural network to select the appropriate dimension of predictor variables. Therefore, although the stock price has the characteristics of high noise, high dimensionality, and difficult to capture information, the stock price prediction model based on the orthogonal Gaussian basis function expansion and Pearson correlation coefficient weighted LSTM neural network can indeed predict the future stock price through historical data, and it is relatively accurate. In this paper, we believe that this prediction method will have certain discussion value in the future: applying the effect of other fields to explore; For high frequency time series data prediction; Whether different extraction methods of function information will affect the results, whether there are differences, etc.