Research on Ensemble Learning-based Housing Price Prediction Model

Housing price is influenced by multiple factors. The existing housing price forecasting model usually belongs to the so called single predictor model, whose prediction accuracy is not ideal and the over-fitting phenomenon often happens due to the data noise. To resolve these issues, this paper proposes an ensembe lerning-based housing price prediction model incorporating various predictors. To evaluate the effectiveness of the proposed model, extra trees, random forest, GBDT and XGB algorithms are selected for the benchmarks. The dataset used is the California housing price available over the web. The results demonstrate that the proposed method can improve the predicting accuracy and stability compared with other four single prediction models.


Introduction
Real estate is not only a key sector in the national economy, but also one of the people's major concerns.Due to the housing demands, people's attention to the housing price continues to increase.Therefore, it is critical to provide accurate predictions of housing prices.Housing price is impacted by multiple factors ( [2], [10]) including time and space, house ages, surrounding conditions, communities, transportation, etc. Existing prediction models are usually single predictor ones, i.e., a single forecasting model is applied to the prediction.The prediction accuracy of this model is not satisfactory when datasets are noisy [4].Some simple ensemble models such as random forest would encounter over-fitting phenomenon when the data contain more noise.To address these issues, the paper proposes an ensemble learning ( [1], [11]) based housing price prediction model.The model is built upon multiple single predictors (they will be called base predictors in the following discussions) including random forest (RF), extra trees (ET), GBDT, and XGB.
Random forest [7], whose basic unit is a decision tree, is an ensemble algorithm/model employing multiple trees.It shows its superiority in many application areas.It is capable of handling high dimensional data without feature selection.It can get an unbiased estimation of the internal generation error during the forest generating process, and the generalization capability is good.Nevertheless, random forest may suffer overfitting in some classification or regression problems where noise occurs.
Extra trees, also known as Extremely Randomized Trees, is the combination of decision trees.The following table (table 4) lists the loss function values measured by mean square errors, which can be applied to evaluate the performance of each predictor.Based on the outcomes, it is not difficult to find out that both GBDT and XGB produce better prediction results than ET and RF do in terms of MSE (mean sqaure error).The next section will present more details on how to apply these four base predictors to construct an ensemble model.Further computational experiments and benchmarks are carried out to evaluate the effectiveness/performance of the resulted ensemble model.

Model ensemble and training
In the previous section, four base predictors are trained and the corresponding forecasting models and results are obtained.These four base predictors will be employed to create the final ensemble model.The entire training process for the ensemble model is performed in the following two stages:  Assuming that the given sample dataset L= {{x i , y i }, i = 1, 2, ..., n} contains n tuples (samples), where x i is the feature vector of the i-th sample after the dimensional reduction or PCA, y i is the i-th target or real value.Specifically, in this case, there are 20,000 samples with each having a certain number of features and y i is the true housing price associated with ith sample.
In order to prevent the over-fitting situation from happening, the principle of crossvalidation is applied to construct the second-level dataset.As Stacking ensemble learning method is used to predict the house price data, the four basic predictors are needed to predict once to get the prediction result, then the result and a part of original dataset got before are merged as the second-level dataset using a well-perform predictor to predict again, so as to avoid some of the predictors' decision, and ensemble the four predictors' predicting result as the final predicting result.The original dataset L is randomly divided into k parts (they are called subsamples in the following discussion) L 1 , L 2 , ..., L k .Furthermore, define L i and L^i= L -L i , for i = 1, 2, ..., k are defined to be the i-th cross-validation training and testing datasets respectively.
Four base predictors will be trained separately using the training datasets and four resultant base predictors are obtained.The prediction result achieved by the j-th predictor on i-th sample in the testing dataset is denoted by Z ij .Since the number of subsamples is k, thus, the training process repeats k times, whereas each subsample will be used for performing t predictions and obtaining the corresponding predicting results.These predictions together with the target values of the corresponding samples form the dataset used for the second-stage, namely, L cv = {(Z i1 , Z i2 , ..., Z it , y i ), i = 1, 2, ..., n}.Through this process, the training dataset of the ensemble training is a new dataset consisting of all prediction results and the corresponding target values (housing prices in this case).At the end, the final ensemble prediction model is obtained upon L cv .A base predictor would have its pros and cons, and it might not be able to work on all datasets with the universal superiority.By applying the ensemble techniques, the advantages of the underlying base predictors or models are strengthened while the shortcomings of these base models are avoided.The ensemble model demonstrates its effectiveness in dealing with datasets with noise and overfitting problems.

Conclusion
This paper presents a housing price prediction model built upon ET, RF, GBDT and XGB by applying the stacking ensemble learning methodology.The process of building an ensemble model includes extracting relevant features from California housing price data, performing the dimensional reduction, and training the model respectively.During the ensemble model construction, the individual prediction results are used as the inputs for training the ensemble predictor, which leads to the final prediction model.The advantage of this model is that it can improve the prediction accuracy and effectively avoid the overfitting when the dataset contains noise or too many features.At the same time, the ensemble model is able to produce more stable results.Although the proposed ensemble model functions cannot be claimed better than each base predictor consistently for all scenarios, the outcomes obtained by the ensemble model are very promising.It also encourages people to apply the similar technology to other machine learning problems in the future.

Table 4 .
Mean square errors of four base models

Table 5 .
MSE for all modelsIt is not difficult to recognize the prediction results obtained by the ensemble model with the lowest MSE, which is reduced by 6.7% on average.The computation results indicate that the ensemble model is able to provide the most accurate predictions in general.