A Study on the Transaction Price of Second-hand Sailboats Based on the Random Forest Regression Model

: The second-hand sailboat market is booming but the prices are uncertain, which poses a significant challenge for sellers to determine the optimal selling price. To address this issue, this study employed three regression models, namely Random Forest Regression Model, Decision Tree Regression Model, and Supporting Vector Machine Regression Model, to explore the main factors affecting the pricing of second-hand sailboats, and predict the prices of second-hand sailboats. The result shows that the length of the second-hand sailboats impacts the most and the Random Forest Regression Model has the highest accuracy in predicting the transaction prices of second-hand sailboats. This prediction method can help sellers better price their boats and promote the development of the second-hand sailboat market.


Introduction
In recent years, sailing has become an increasingly popular recreational activity, with thousands of individuals and families taking to the water each weekend.As a result, demand for quality secondhand sailboats has increased, as budget-conscious buyers seek out affordable options [1] .
However, the pricing of these sailboats is highly complex and uncertain, with numerous factors affecting their value.In order to solve this problem, a lot of researchers in the world have carried out research.Eleftherios Ioannis Thalassinos and Evangelos Politis provide an evaluation model for the pricing of used bulk carriers.Their valuation process relies on cash flow analysis and methodology [2] .Besides, Andreas estimated the shadow price of the most relevant determinant in the second-hand ship hedonic price model through the analysis of the second-hand ship buying and selling fixtures [3] .Despite the efforts of several researchers, a comprehensive understanding of the pricing trend and variation in the used sailing boat market has remained elusive.
To address this problem, the paper aims to develop a prediction model for second-hand sailboat prices through regression analysis to improve the accuracy of price prediction.Three regression models, namely Random Forest Regression Model [4] , Decision Tree Regression Model [5] , and Supporting Vector Machine Regression Model [6] , will be employed to predict the prices of secondhand sailboats and compare them with the actual transaction prices.These three models are used to identify the main factors affecting the pricing of second-hand sailing boats and determine the superiority of one regression model over the others in predicting the transaction prices of second-hand sailboats.It is expected that this paper will contribute to helping sellers better price their boats and promote the development of the second-hand sailboat market.

Sailboat Price Model Construction
This paper extracts transaction data for second-hand sailboats in 2019 from the authoritative boat website boats and categorizes the data by Make and Geographic Region.Every transaction message records information about the sailboat, such as Make, Variant, Length (ft), Geographic Region, Country/Region/State, Listing Price (USD), Year, Make Variant, LWL (ft), Beam (ft), Draft (ft), Displacement (lbs), Sail Area (sq ft).
Considering the uncertainty of second-hand sailboat prices, this paper adopts three methods to predict the price of sailboats, hoping to improve the accuracy of the prediction results as much as possible and provide better pricing suggestions for sellers.

Data Preprocessing
For the data-analysis problem, it is found that a large amount of raw data contained some incomplete and abnormal values, which could significantly affect the efficiency of modeling and the accuracy of conclusions.Therefore, it is crucial to preprocess the data.
Three methods are used to process the data loss and abnormal values.First, missing data are identified.The ways missing data is processed are: (1) For variables with a large amount of missing data, they are directly deleted.(2) For variables with a small amount of missing data, an interpolation method is used to compensate for the data.After screening out the 3500 sets of data, 3017 sets of valid data are left.

Variance Test
The variance test is used to select features by calculating the variance of the features, usually by setting a variance threshold, and considering deleting features that do not reach the variance threshold [7] .1.
From the calculation results, the variable with the smallest variance is Draft, with a variance of 1.9844.The variance is not completely close to 0, which means that the sample value still contributes.

Spearman Test
The Spearman Test is used to calculate the correlation between the independent variables with the correlation coefficients ( ) [8] .The formula of the Spearman Test is in the formula (1): These show the correlation strength between R and S, where is the correlation coefficient of Spearman.And the closer to 1, the correlation strength between R and S is stronger.
Next, the Spearman test is used to calculate the correlation between the independent variables.Then, the testing result is in the thermodynamic diagram in Figure 1.

Figure 1: Heat map of correlation coefficient of independent variables
From Figure 1, it can be seen that the correlation coefficient between length and LWL is 0.9, which has a high linear correlation between the two.In fact, a longer sailboat has a longer length of waterline, and the length of the waterline is a representation of length.Hence, in order to avoid the adverse effects of multicollinearity on the model, the variable LWL is eliminated.Finally, the remained features are shown in Table 2.

Sailboats Price of Random Forest Regression Model (SPRF Model)
The machine forest algorithm is a combination of the Bagging algorithm method and Random Subspace algorithm, and the basic building block is a combination of decision trees (either binomial or multinomial tree) [4].
The initial values are: Where n is the number of samples, M is the number of features.
The random forest algorithm consists of two "random" processes: the first "random" process is to randomly generate a training set, with the aim of using the training set to complete the training of the model; The second "random" process is the random selection of a subset of features, which are calculated to select the best-split feature attributes [9] .

Sailboat Price of Decision Tree Regression Model (SPDT Model)
Decision Tree is a decision analysis method based on the known probability of occurrence of various situations, to evaluate the project risk and judge its feasibility [10] .To calculate the information gain g(D,A) of feature A on the data training set D: Where the empirical entropy H(D) of data set D is: The empirical condition entropy H of feature A on data set H(D|A) is: In this problem, it is used to decide tree model to analyze the prices of second-hand sailboats, which is applied in sequential decision-making, to take maximum benefit expectation value or minimum expectation cost as the decision criterion.

Sailboat Price of Supporting Vector Machine Model (SPSVM Model)
Given the linear data to fit the function class ( ): Φ = span{ϕ 1 (x), ϕ 2 (x), ⋯ , ϕ n (x)} = {ϕ(x) = a 1 ϕ 1 (x) + a 2 ϕ 2 (x) + ⋯ + a n ϕ n (x), a i ∈ R, i = 1,2, ⋯ , n} where ϕ 1 (x), ϕ 2 (x), ⋯ , ϕ n (x) is linearly independent.Then, the deviation of ϕ(x) on x i is: To optimize the linear programming, this paper chooses the support vector machine regression model (SVM).A Support Vector Machine (SVM) is a generalized linear classifier that classifies data binary according to supervised learning.Its decision boundary is the maximum-margin hyperplane for solving the learning sample.It is generally used for classification tasks, and support vector regression (SVR) is a variant of SVM in regression analysis [8] .

Results
The results show that the three models have made accurate predictions on the second-hand sailboat prices.
In order to evaluate the model prediction effect, there are the model evaluation indicators introduced: (1) Mean absolute error: The mean absolute error is the average of the absolute values of the deviations of all individual observations from the arithmetic mean, also known as MAE: The smaller the MAE, the better the model prediction; conversely, the larger the MAE, the worse the model prediction.
(2) R-squared (correlation coefficient): Where y is the actual value, y is the mean value, and   ^ is the fitting value.R 2 ∈ [0,1].The closer the value R 2 is to 1, the better the model prediction;

Result of SPDT Model
The final result of the SPDT Model is in Table 3.    2 presents the validation set of the Decision Tree Regression Model which is applied to predict the listing price of used sailboats.The model is trained and tuned to minimize the mean absolute error (MAE) of the predicted prices, which is achieved through a series of iterations.The low MAE value of 0.1927 indicates that the model is able to predict the listing price of sailboats with reasonable accuracy, i.e., it computes the difference between the actual and predicted price to be within an acceptable range.The low error rate also suggests that the model is not overfitted, which means that it can generalize well to unseen data.The model's good performance on the validation set further supports its ability to accurately predict the listing price of sailboats.

Result of the SPSVM Model
R-squared and MAE of the SPSVM Model are calculated.The results are shown in Table 4.  3. Figure 3 shows the validation set of the Support Vector Machine Regression Model in the market of second-hand sailboats.With the low mean absolute error (MAE) value, 0.1731, which is smaller than that used in the Decision Tree Regression Model, it shows that the difference between the actual and predicted value is also within a reasonable range.Through a series of iterations, the results show that the model can be used to predict the price of second-hand sailboats, comparing the actual and predicted listing prices.The reasonable assessing fit means that the Support Vector Machine Regression Model could predict the unknown data well.

Result of SPRF Model
Also, the R-squared and MAE of the SPRF Model are calculated.The results are shown in Table 5.  Figure 4 shows the validation set of the Random Forest Regression Model to predict the listing price of second-hand sailboats.From Figure 4, the mean absolute error (MAE) is 0.1529 which is the lowest value within the models above.In other words, there is not much difference between the predicted value (blue points) and the true value (red points), which predicts well.Compared with the first two models, the MAE of the SPRF model is lower, indicating that this model has a higher degree of fitting and can more accurately shorten the gap between the predicted and the actual listing price of the second-hand sailboats.This helps sellers better predict the listing price of second-hand sailboats.

Comparative analysis of model results
The prediction results are summarized obtained from the three model training sets and test sets, and obtained in Table 6.7 that the characteristics that have a greater impact on the pricing of second-hand sailboats are the length and year of production of second-hand sailboats.
These three main influencing factors are applied to the random forest regression model, and the predicted visual result is shown in Figure 5. Finally, the predicted second-hand sailboat prices generated by the random forest regression model are compared with the data obtained from the boats' website.Subsequently, the relative error between each set of predicted data and the actual data will be calculated and plotted as a bar chart in Figure 6.6 that most of the relative errors of the forecast are within the range of 0~1.2%, which indicates that the accuracy of our model is very high.From the aspect of absolute error, it is found that the average absolute error between the price predicted by the random forest regression model and the real price is 0.1529.These two error values prove that the random forest regression model is highly accurate in predicting the price of second-hand sailboats, thus providing very valuable suggestions for sellers in pricing.

Conclusion
By using the random forest regression model, decision tree regression model, and support vector machine regression model to predict the second-hand sailboat price, and comparing it with the actual transaction price, this paper puts forward an effective method to solve the challenge of the uncertain

Table 3 :
Result of SPDT Model Rperformance of the Decision Tree Regression Model for the training set obtained from feature selection is better, and the test data set is visualized as shown in Figure 2.

Figure 2 :
Figure 2: The validation set of the decision tree regression model

Figure
Figure2presents the validation set of the Decision Tree Regression Model which is applied to predict the listing price of used sailboats.The model is trained and tuned to minimize the mean absolute error (MAE) of the predicted prices, which is achieved through a series of iterations.The low MAE value of 0.1927 indicates that the model is able to predict the listing price of sailboats with reasonable accuracy, i.e., it computes the difference between the actual and predicted price to be within an acceptable range.The low error rate also suggests that the model is not overfitted, which means that it can generalize well to unseen data.The model's good performance on the validation set further supports its ability to accurately predict the listing price of sailboats.

Figure 3 :
Figure 3: The validation set of the support vector machine regression model

Figure 4 :
Figure 4: The validation set of the random forest regression model

Figure 5 :
Figure 5: The prediction of the listing price

Figure 6 :
Figure 6: The error of the predicted value It is not difficult to see from Figure6that most of the relative errors of the forecast are within the range of 0~1.2%, which indicates that the accuracy of our model is very high.From the aspect of absolute error, it is found that the average absolute error between the price predicted by the random forest regression model and the real price is 0.1529.These two error values prove that the random forest regression model is highly accurate in predicting the price of second-hand sailboats, thus providing very valuable suggestions for sellers in pricing.

Table 1 : Variance of each catamaran variable
be considered as contributing significantly to the differentiation of the sample if it has a largely different value across the data set, as shown in Table

Table 2 :
Characteristic variable remained

Table 4 :
Result of the SPSVM Model By machine learning, the Support Vector Machine Regression Model is visualized graphically in Figure

Table 5 :
Results of the SPRF Model

Table 7 :
Three important variable features in SPRF model