Analysis of the Price Influence Factors of Used Audi Cars Based on Ridge Regression Model

: This paper uses the ridge regression model to explore the factors affecting the price of second-hand Audi cars. A large number of used Audi car feature data were collected, including the Model, Year, Mileage and other characteristics, as well as their corresponding price. In general, since the development of these factors is homogeneous, so most of their data have multicollinearity problems. If OLS is used to estimate the parameters of the model, the parameters obtained may be difficult to objectively and accurately reflect the actual situation [6]. Using ridge regression model for modeling and prediction to solve the multicollinearity problem by introducing a regularization term. When building the model, this text considered the correlation between features and choose appropriate regularization parameters. The experimental results show that through the ridge regression model, this text analyzed the importance of the characteristics of the regression model, and found that the regression coefficient of Mileage Year and Tax is 5.17296619, -0.60579774 and 1.46868943 respectively, indicating that mileage, age and tax are important factors affecting the price of second-hand Audi cars [3]. This study provides a reliable method for predicting the price factors of the used Audi car market, which has an important reference value for both buyers and sellers.


Introduction
Linear regression is one of the earliest and simplest regression models that establishes a linear relationship between the independent and dependent variables.The optimal linear fit was obtained by minimizing the sum of squares.
The general form of the linear regression model can be expressed as follows: In the linear regression model, we estimate the model parameters by minimizing the sum of residual squares Minimize ∑ (  − (β 0 + β 1 X i1 + β 2 X i2 +. . .+β p * X ip )) 2 (2) However, when there is multicollinearity between the independent variables, the estimation results of the above least squares method may be unstable and have a large variance.To solve this problem, we introduce a regularization term that limits the size of the model parameters, thus improving the stability and generalization ability of the model.
Ridge Regression The L2 norm is introduced as a regularization term, and its objective function is: ∑ (  − ( 0 +  1  1 +  2  2 +. . .+  *   )) 2 +  * ∑() 2  [4] (3) Among α is the regularization parameter that is used to control the degree of regularization.By adjusting the value of α, we can balance the trade-off between the sum of the fitted residual squares and the regularization terms.

Data collection
The data set is from the Kaggle website and can be obtained through the following links: https: / / www.kaggle.com/code/smailaar/auidi-vehiclespredict-regression/input The dataset contains information on used cars from Audi cars used to predict the price of used cars.The data set covered multiple features, including Model, Year, Price, Transmission, Mileage, FuelType, Fax, Mpg, and EngineSize.Each feature in the dataset has its own unique meaning and data type.Among (Model, Transmission, FuelType) are categorical variables (Year, Price, Mileage, Tax, Mpg, EngineSize) are numerical variables.
The categorical variables are shown in Table 1.

Data visualization
The number of each used Audi car was obtained through data processing and analysis, and the results are shown in Figure 1.After data analysis, in the three years from 2015 to 2018, the maximum supply of second-hand Audi cars with different gearboxes in the market every year, before 2018, the second-hand Audi cars were the most in the market and automatic transmission was the least; but in 2018, the supply was the largest and the manual transmission was the least.The results are shown in Figure 3.By analyzing the price distribution of different second-hand Audi models, it is found that the price distribution of R8 models is the largest, that of A2 and RS7 models is the smallest, and the price is basically at the same level.The results are shown in Figure 5.

Processing of the data
Firstly, the collected data were processed and the independent variables (Model, Year, Transmission, Mileage, FuelType, Tax, Mpg), fit predictive variables (price), and a strong collinearity between the unit price of the used Audi is too high; the common least-squares OLS regression analysis cannot be used, and a ridge regression model is needed.

Standardized processing of the data
Data Standardization, Also known as data normalization or feature scaling, is a commonly used data preprocessing technology that transforms and unifies different data according to certain rules so that they have similar scale, range or distribution.
In this paper, we need to transform six different variables (Year, Price, Mileage, Tax, Mpg, EngineSize) into unified standard scales.We mathematically run the raw data by Min-max normalization so that the data numerical variables are mapped in the range of 0 to 1.For each feature, we can perform the Min-max normalization in the following steps: For each data point in each feature, Min-max was standardized using the following formula: x' = (x -min(x)) / (max(x) -min(x)) Where x is the raw data, x' is the standardized data, min (x) is the minimum value of the original data, and max (x) is the maximum value of the original data.

One-hot code
One-hot code, also known as one-bit effective coding, mainly uses N-bit state register to encode N states [5].It maps the value of each categorical variable to a new feature vector consisting of only 0 and 1, and is used to represent the different categories of the variables.The principle of single-heat encoding is to convert each category into a unique binary code.
In this paper, One-hot coding method is used to classify three features: Model, Transmission, and FuelType, establish a unique coding representation for all different values of the three classification features, and then use 0 and 1 to indicate whether this feature has this feature.For example, for the 'model' feature, if there are N different models, then N binary features are created to represent the presence or absence of each model.Single-thermal encoding can retain information about categorical features and partly avoid the influence of size relations between different values on the model.

The principle of ridge regression
= (  + ) −1    [2] (5) λ is the ridge coefficient, I is the unit matrix (all are 1 on the diagonal, other elements are 0).The identity matrix is the full rank matrix, and multiplied by λ is still the full rank matrix.
Cost function of the ridge regression Cost function of regularization: L2 regularization (square of weights) Called the L2 regularization term This penalty coefficient is a key parameter for regulating the quality of the model, and we illustrate how it regulates the model complexity through two extreme cases [4].
The λ value is 0: the loss function will be the same as the original loss function (the least squares estimation form), indicating that there is no penalty for the parameter weight θ.
λ for infinity: in the case of penalty coefficient λ infinite, in order to ensure the whole structure risk function minimized, only by minimizing the ownership weight coefficient θ, namely through the λ penalty reduces the weight of the parameter, and reduce the parameter weight while we achieve the effect of reducing the complexity of the model.
(2) Overfitting is caused by the excessive complexity of the model.
Ridge regression was first used to handle cases where features are more than samples and is now also used to incorporate bias into the estimates to obtain better estimates.It can also solve the problem of multicollinearity, and ridge regression is a biased estimation.
Here we limit the sum of all w by introducing λ, and by introducing the penalty term, we can reduce the unimportant parameters, a technique also called reduction in statistics.

Model training
Ridge Regression Is an extended model of linear regression for processing data with collinearity.We control the complexity of the model by introducing a regularization term and reduce the variance of the parameter estimates.
The ridge regression model training steps are described as follows: (1) The standardized numerical typed variables and categorical variables coded by One-Hot were combined into a new dataset.
(2) After the data set and completion, the dataset is divided into training set and test set for training and evaluation of the model.We divided the data set according to a certain proportion ( 70% is training set, 30% is test set).
(3) Ridge regression models were trained using the training set data.During training, the model optimizes the model parameters by minimizing the loss function.The ridge regression model adds an additional L2 regularization term to control the sum of squares of the model parameters.
(4) During training, methods such as cross-validation are used to select the optimal regularization parameter values.By trying different parameter values and evaluating the model performance, we select the parameters that make the model perform best on the training and test sets.
(5) The test set is predicted using the trained model, and the performance of the model on the test set is evaluated.The evaluation indicators used were the MSE and  2 .The results of the evaluation are shown in Table 3 below.

Model evaluation
MRSE is an indicator to measure the prediction error of the model, and the smaller the value indicates the more accurate the model predicts about the target variable.The MRSE values on the training and test data were 0.061193 and 0.061109, respectively, indicating that the average error of the model was small when predicting the used car price.
R² is a measure of the model of the variability of the target variable.The values range between 0 and 1.The closer to 1, the better the model's ability to interpret the target variable.The R² values on the training and test data were 0.958 and 0.959, respectively, indicating that the model is able to explain about 96% of the price variability of the training and test data.

Regression coefficient
The regression coefficients for the different variables were obtained by model fitting, and the three features contributing the most to the prediction results are shown in Table 4.

Comparison of the regression coefficient
Based on the regression coefficient provided by the model, the following conclusions can be drawn: (1)The production year, mileage and tax characteristics of used Audi cars contributed more to the forecast results, with the regression coefficients of 5.17296619, -0.60579774 and 1.46868943, respectively.This suggests that mileage, age, and taxes have significant effects on used car prices.The younger the car is, the higher the price, the less the mileage, the higher the price.
(2)The regression coefficient of other features is close to zero, indicating that other features also have a certain influence on the prediction results, but they are relatively small.

Put forward some suggestions for second-hand Audi car sellers for this study
(1) The year and mileage of used cars should be reasonably considered when pricing, and the market conditions and the price level of similar models should be consulted to ensure the competitive pricing.
(2) For second-hand cars with more mileage or long use time, necessary repair and maintenance can be considered, such as replacement of worn parts, cleaning or replacement of interior decoration, appearance beautification, etc.By repairing and improving the car conditions, improve the overall quality of used cars, and increase the interest of buyers and their willingness to buy them.
(3) Pay attention to the influence of tax factors.According to the model coefficient, taxes and fees also have a certain impact on the price of used cars.Sellers should understand and accurately convey the tax information in advance, to avoid the price uncertainty caused by the change of taxes, so that consumers have doubts about the price of second-hand cars.

Opinions on Audi car manufacturers
(1) According to the model coefficient, the age and mileage have a great impact on the price of Audi used cars, indicating that consumers are very concerned about the wear and tear of second-hand car engines.Manufacturers should pay attention to improving the engine technology of cars to reduce the loss of engines.
(2) Manufacturers need to provide perfect after-sales service, such as car repair, car maintenance, to increase the retention rate of used Audi cars.

Innovation points of the model
(1) The traditional linear regression model is prone to overfitting phenomenon in the presence of collinearity (highly correlation between independent variables), while the ridge regression model can effectively deal with the collinearity problem and improve the generalization ability of the model by introducing regularization terms.
(2) The prediction of second-hand car price is affected by multiple characteristics, including models, mileage, number of years, car condition, etc.When using ridge regression models, these

Figure 1 :
Figure 1: Number of models Through data analysis, the four most popular models are A1, Q3, A4 and A3, and the proportion of their sales is calculated.The results are shown in Figure 2.

Figure 2 :
Figure 2: The four most popular models

Figure 3 :
Figure 3: Maximum supply of used Audi cars with different transmissions in the market each year

Figure 4 :
Figure 4: The number of used Audi cars with different gearboxes

Figure 5 :
Figure 5: Price distribution of different used Audi models

Table 1 :
For categorical variables

Table 4 :
Regression coefficient result