Population prediction in China based on maximum information coefficient and NAR-BP neural network

: This paper studies the issue of national population census. Firstly, the paper collects census related data, establish a maximum information coefficient model, and preprocess the data. Then, it establishes a dynamic neural network prediction model based on NAR-BP to predict the total population of China in 2030. Furthermore, a PCA based NAR-BP dynamic neural network prediction model was established to predict the proportion of males and urban population in China by 2030. Finally, a neural network optimization model based on GA-BP was established to obtain the optimal search term. Based on the analysis of experimental results, it is proven that the frequency of chinese population census is appropriate to be a cycle of 10 years.


Introduction
Throughout the country, both domestically and internationally, changes in population size are regarded as important content and indicators to measure the development, prosperity, and social civilization level of a region [1].The national census is organized by the state to conduct a comprehensive survey and registration of the existing population census sites in the country, household by person, in accordance with the law.The focus of the census is to grasp the changes in the existing population, gender ratios, and urban-rural population data in various regions, in order for the state to formulate next development policies.At present, the interval between national population censuses in China is about 10 years.From 1949 to 2021, China conducted 10 population censuses.It would be very meaningful to provide effective predictions of various census data through mathematical modeling.
In this paper, we tend to solve several essential problems.The total population of China in 2030, the proportion of males and the proportion of urban population in China in 2030 and the reasonable explanation for the frequency of the current census.
Firstly, consider using the maximum information coefficient to screen numerous factors, and then combine it with NAR-BP neural network [2] to establish a prediction model.Due to the mutual influence of various influencing factors, we consider using correlation statistical methods that require less data to explore the potential information and internal connections of the data, and obtain the correlation between population and various factors.We use principal component analysis to reduce the dimensionality of the data, and then perform linear regression on the reduced dimensionality data to identify and rank the factors that directly affect the male population and urban population.Then the NAR-BP dynamic neural network model is used to predict the proportion of males and urban population in 2030 [3][4].

Assumption
We do not consider relocating our population abroad, nor do we consider relocating foreigners to China.

NAR neural network
NAR neural network is a dynamic neural network model based on time series [5], where the inputs and outputs of the model are synthesized based on the dynamic results of the system before that time.This article selects 6 influencing factors through the MIC algorithm and uses the NAR neural network to predict the influencing factors.

Primary Component Analysis
The basic principle of principal component analysis (PCA) [6]: The principal component analysis method mainly concentrates information scattered on a set of variables onto certain comprehensive indicators, namely principal components.Each principal component is a linear combination of the original variables, with orthogonal relationships between the principal components, which can reduce the dimensionality of the multivariate time series, remove redundant information, reduce some noise contained in the multivariate time series, and reflect the correlation between different variables.When the sample data has a large number of dimensions and a complex structure, using principal component analysis can simplify the input samples, reduce training time, improve training efficiency, and achieve the goal of improving the generalization ability of the neural network, as shown in Table 1.

NAR-BP dynamic neural network prediction based on maximum information coefficient
This article collects the total population, per capita GDP, proportion of urban population, proportion of males, employment rate, birth rate, mortality rate, fertility rate, number of medical institutions, medical expenses, education expenses, number of marriages, age distribution between 0-14 years old, 14-65 years old, and proportion of people aged 65 and above from 1978 to 2020, as shown in Table 2 and Table 3 ,

Data preprocess
BP neural network is back propagating, mainly composed of three parts: input layer, middle layer and output layer.The number of nodes in the input and output layers is relatively easy to determine, but the determination of the number of nodes in the hidden layer is a very important and complex problem.
(2) Normalize the maximum mutual information scores found above and compile them into one.
The characteristic matrix AM of the rows and columns and the normalized score between 0 and 1.
(3) Using, and the normalized score as the point coordinates in three-dimensional space, the total maximum mutual information scores can form a surface, and the highest point of the formed surface is the final MIC value.MIC does not rely on the distribution assumption of measurement data and can identify a wide range of associations compared to previous studies.Assuming a bivariate large dataset 2DR containing n samples, the MIC of the sum of two vectors is defined as follows.

 
Where x,y are the number of grids in the x-axis and y-axis zones, respectively.This paper uses the MIC algorithm for data preprocessing, and sequentially obtains the correlation strength between the total population and 13 influencing factors such as per capita GDP, mortality rate, and birth rate.The results are shown in Table 4.

Figure 1: Thermodynamic diagram of 13 influencing factors related to intensity
In Figure 1, each column represents a sample, and each row represents a variable.The color represents the correlation strength between the population and each variable, indicating the difference in correlation strength between these screened variables and the population.At this point, the color represents the size of the correlation coefficient.Therefore, from the graph, it can be seen that the variable itself has a correlation coefficient of 1 with itself, which is the darkest blue color.The closer the white color is, the weaker the correlation is.A blue (positive correlation) or red (negative correlation) color indicates a strong correlation.
Select highly correlated influencing factors from 13 factors, which are in order: proportion of labor force population, proportion of urban population, birth rate, proportion of aging population, fertility rate, and medical expenditure.

Experiments
The paper takes 6 main influencing factors as input nodes, inputs 10 hidden neuron numbers and 6 lagging orders, and selects the Levenberg Marquardt algorithm to train the NAR neural network.The training function is trainlm, the transfer function is tansig function, and the weight adaptive learning function is learngd function; Divide the data into training set 70%, validation set 15%, and testing set 15%.Train the NAR network, stop training when the sample mean square error increases, and calculate the prediction result; Subsequently, the trained NAR model is combined with a BP neural network to obtain the predicted values of impact factors and population from 2021 to 2030 through rolling grouping.
The population is showing a continuous growth trend, and by 2030, the total population of the country will reach 1488.9633 million people.
To verify the effectiveness of the PCA-NAR-BP model, the sample data from 1978 to 2010 was used as the training sample set to predict the proportion of males and urban population in China from 2011 to 2020.The average errors were found to be 0.0501126 and 0.21304, respectively, indicating a high feasibility of this combined model.Table 5 shows the results.
The model predicts the data from 2021 to 2030, and use the rolling grouping method to predict the male and urban population proportions of the following year using the 10 year prediction factors until 2030.Finally, the male and urban population proportions will be 51.38% and 66.33% respectively in 2030.The proportion of males will remain relatively stable over the next decade, but there will be a downward trend from the first census to 2030; The proportion of urban population is showing a gradual growth trend in the next decade.Due to the lack of early statistical data, the urban population has been increasing year by year from the third census to 2030.

The suggestions for census
If the census frequency is once every 5 years, it is found that from 1975 to 1990.During the year, the fertility rate, the proportion of the working population, and the proportion of the aging population remained at normal levels, but they also consumed a large amount of manpower, material resources, and financial resources due to high-frequency surveys that did not identify problems.
If the frequency of the census is once every 15 or 20 years, from 1985 to 2005, the fertility rate, proportion of working population.The proportion of the aging population has exceeded the range value, with a sudden decrease in fertility rate to 16%, a sharp increase in the proportion of the working population to 70%, and an increase in the proportion of the aging population to 7.5%, all exceeding the range value.This has led to a deepening of aging, heavy family burden, and a decrease in the number of new forces.The inability to adjust policies in a timely manner has seriously affected the country's fertility level, population balance, and economic development.
The most suitable frequency for conducting a census is once every 10 years, which not only allows for proactive measures, but also targeted measures by the country.Developing population related strategies and policies to promote long-term balanced population development provides strong statistical information support, which to some extent saves a lot of manpower and material resources.

Sensitive analysis
Taking into account the impact of various factors on population change, a model suitable for predicting population change was established.The data was analyzed using the Maximum Information Coefficient (MIC) algorithm and the NAR-BP neural network to identify the main indicators that affect population change.It is possible to make more accurate predictions of abnormal population changes in the early stages and make corresponding policy adjustments based on the abnormal changes in corresponding indicators.
The method mentioned in this paper can rely on different data to establish different models, and can also integrate multiple data modeling.The model adopts various data analysis methods, such as the maximum information coefficient (MIC) algorithm, principal component analysis (PCA), and NAR-BP dynamic neural network model, which have good pertinence to the problem and are compared with other methods.This allows us to choose a model that is more closely related to population prediction based on the output results of the model, thereby improving the accuracy of the model.

Conclusion
This paper provides a reference basis for the current trends in population distribution, quantity, and structure, and provides a better grasp of the future.Not only can it perform good analysis on predictions, but it can also effectively solve similar evaluation and prediction problems.However, there are many dynamic factors that affect population growth predictions, and they cannot all be affected, so there is still some distance between the model and reality.Different models have high predictive power at corresponding time stages, but once they leave this time stage, the predictive power of the model will decline.In today's increasingly high demand for scientific and quantitative decision-making, our work is undoubtedly in line with the trend of the times and the development needs of the situation.

Table 4 :
The related index of 13 influencing factors