Analysis and Prediction of Water Consumption in Nanjing—A Comparative Study Based on Stepwise Regression and Nonparametric Mode

: Accurate prediction of urban water consumption is of great significance to the management and improvement of water supply system. In this thesis, through correlation analysis on 16 factors affecting total water consumption in Nanjing from 2005 to 2019, the factors among them that have more obvious influence are selected, and then they are fitted by stepwise linear regression model and nonparametric regression model respectively. By comparing the fitting results, it is found that nonparametric regression model has better prediction effect than stepwise regression model. The analysis results show that the prediction of urban water consumption is very important for the rational allocation of water resources.


Introduction
Water is one of the most important material resources for human survival and development.Abundant available water resources are conducive to accelerating urban development and strengthening its capacity, ensuring the sustainable development of society, economy and ecology, and promoting the construction of a harmonious society.The urban total water consumption can reflect the long-term demand of urban water, thus accurate prediction of the urban total water consumption is not only beneficial to the rational allocation of water resources, but also provides reference basis for ensuring sufficient urban domestic, industrial and agricultural water consumption, which are the vital parts of water resources allocation in urban development.The government has always attached great importance to more accurate prediction of urban water allocation in urban development and plans.[1] The prediction methods of urban water consumption based on historical observation data mainly include time series analysis, structural analysis and systematic analysis [2].In recent years, the rapid development of combined analysis is also the main research object of domestic scholars.The time series analysis, mainly based on the periodic change rule of urban water consumption, carries out statistical analysis with a certain time as the change period, and puts forward the mathematical model and calculation method of fitting and prediction, which mainly include moving average method [3] and exponential smoothing method [4], etc.Since this analysis does not show obvious change trend of random disturbance of water consumption, and has stricter requirements on various related influencing variables and higher dependence on historical data, it is more suitable for medium-and long-term prediction.
The structural analysis, focusing on the relationship between urban water consumption and various related factors (such as gross regional product, regional water consumption population, and temperature conditions, etc.), finds out the main influencing factors and problems on rational urban water consumption planning, and puts forward reasonable policy suggestions based on the analysis and prediction results.The structural analysis is mainly composed of regression analysis [1,5,6,7] and stepwise regression [8].However, the factors affecting water consumption are complex, and it is difficult to accurately predict the future values of each influencing factor, so the prediction results will have a small deviation.
The system analysis does not investigate the effect of individual factors, which means it will reflect the comprehensive effect of various factors, weaken the influence of random factors, and thus reflect the inherent regularity of water consumption change.This analysis mainly includes artificial neural network method [9,10], grey prediction method [11], system dynamics [12] and other methods, which can be used to predict medium-and long-term or short-term water consumption.The disadvantages are that the accuracy is low when predicting time series with unstable changes or abrupt structural changes, and the calculation of it is very large.The performance of the algorithm depends on the initial conditions, and the process is easy to fall into local minimum and the convergence is slow.
In recent years, scholars at home and abroad have continuously explored better prediction models that can adapt to the current urban development, and then the combined prediction model [13][14][15] has gradually developed, which has been studied by most scholars, and is often conducted on a large amount of data combined with neural networks.
By combing the domestic literature on the prediction of urban water consumption, it is found that there are many research results at present, and most of them focus on empirical research.However, there are some differences in the research objects, research indicators and research methods chosen by scholars, and their models have their own advantages and disadvantages.Most of the researches focus on short-term effects.For the long-term sustainable development of urban water consumption and the rational allocation of water consumption, there are few studies on the impact of economic benefits, and there are few literature that use nonparametric models to predict, and most of them are about short-term predictions [16,17].This thesis selected the actual water consumption data of Nanjing from 2005 to 2019 as the dependent variable, compared the nonparametric model with the stepwise linear model, used the model for prediction, and showed the advantages of nonparametric model in data fitting and prediction under the condition of long-term stable economic development.

Research methods
In this thesis, 16 influencing variables are selected from the long-term influencing factors of actual water consumption in Nanjing, and the statistical data from 2005 to 2019 in Nanjing, Jiangsu Province are taken as samples.By using correlation analysis to eliminate the variables with smaller influencing factors, nonparametric regression simulation is carried out, and the simulation results are compared with those of stepwise regression model to analyze the effects of the two models on the prediction of urban annual water consumption.

Data Source
In view of the availability of data, the data in this thesis comes from Nanjing Statistical Yearbook from 2006 to 2020, and the total water consumption TWC (10,000 cubic meters) is selected as the dependent variable.Industrial water consumption IWC (10,000 cubic meters), domestic water consumption DC (10,000 cubic meters), water consumption population WCP (10,000 people), production water reuse PWR (10,000 cubic meters), annual sewage treatment consumption ASTC (10,000 tons), gross regional product GRP (100 million yuan), per capita GDP GDPpc (yuan), average temperature AT(℃), average maximum temperature AMAXT(℃), average minimum temperature AMINT(℃), extreme maximum temperature EMAXT(℃), extreme minimum temperature EMINT(℃), total precipitation TP (mm), cultivated land area at the end of the year CY (thousand hectares), green coverage rate of built-up area GCR(%) and year-round sunshine YRC (hours) are the preset influencing variables.

Multivariate Stepwise Linear Regression
The basic idea of stepwise regression is to introduce variables into the model one by one.After each explanatory variable is introduced, F test should be carried out, and the selected explanatory variables should be tested under t test one by one.When the original explanatory variables become no longer significant due to the introduction of later explanatory variables, they should be deleted to ensure that only significant variables are included in the regression equation before each new variable is introduced.This is an iterative process until neither significant explanatory variables are selected into the regression equation nor insignificant explanatory variables are eliminated from the regression equation to ensure that the final set of explanatory variables is the best and simplest.
Based on the selected dependent variable and 16 independent variable data, the stepwise regression results and analysis are as follows: Table 1 is the result of each step of model simulation, from which we can find that R, R² and adjusted R² of model 4 are higher than other models, indicating that the model 4 has the highest fitting degree and the best fitting effect.Therefore, the stepwise regression model selects GRP (100 million yuan), IWC (10,000 cubic meters), ASTC (10,000 tons) and DC (10,000 cubic meters) as the main influencing factors affecting the urban TWC.(1) This model has good significance.

Correlation Test
In order to analyze the influence degree of the important factors that affect the annual TWC in Nanjing more comprehensively, this thesis referred to relevant data, analyzed papers and literature, consulted experts from all sides, and finally selected 16 related independent The results obtained by the above three correlation analysis methods are basically the same, which shows that each factor has different influence degree on the dependent variable, and the influence effect is also different.Therefore, in order to reduce the problem of inaccurate prediction results when the nonparametric research dimension is large, this thesis eliminates the variables with low and insignificant influence, and uses the variables with significant residual influence to study.

Nonparametric Regression
Since the factors affecting urban water consumption are complex and varied, and the influence degree, method and effect of each variable are different, it is easy for ordinary parametric models to have larger "specification error" when fitting and even predicting later.For example, if the real population is not normal, or even far away from normal, there may be a big deviation if statistical inference is made by using the model with normal hypothesis premise.That is to say, the parametric model is highly dependent on the model setting, so it may not be robust enough.Therefore, this thesis uses nonparametric regression method to simulate the research object, and compares the results with those of stepwise regression method.
Providing y is a to-be-explained variable, a random variable, and x is a k dimensional explanatory vector, representing a number of important factors.It can be either a deterministic variable or a random one.Now consider the following nonparametric multivariate regression model: where   ( = 1,2, … , ) is a random error term, (⋅) is an unknown multivariate function, which reflects the influence of other observable or unobservable factors on the explanatory variables besides the to-be-explained variables, and the specification error of the model.
The difficulty (and advantage) in estimating the model ( 2) is that (⋅) is an unknown function (even the form of the function is unknown).Because of this, this model can better reflect the real regression relationship.The idea of nonparametric regression is to estimate (  ) respectively for each i( = 1,2. . ., ), so as to get the estimation of regression function ().In a sense, we are not looking for the analytical solution to (), but for its numerical solution.
The kernel regression estimator of multivariate nonparametric model at  0 can be written as: where (⋅)is dimensional kernel function.The properties of multivariate kernel regression estimators are similar to those of univariate kernel regression.However, the optimal bandwidth is ℎ * = /( −1/(+4) ) (greater than the optimal bandwidth in univariate case), and the speed of  ̂( 0 ) convergence to the true value is slower.The more explanatory variables of multiple regression are, the slower the convergence speed and the greater the demand for sample size are.This is also the reason for using multiple correlation analysis to eliminate variables with little influence before doing multivariate nonparametric regression.In order to improve the results of nonparametric regression, this thesis uses the minimized improved AIC criterion to calculate the bandwidth.
It can be seen from table 5 that the value of R² is 0.9974, which shows that the fitting effect of the model is better.

Fitting Effect Comparison
The fitting results of the data are as follows: It can be seen from the fitting data in Table 5 that the fitting value of the nonparametric regression model is closer to the actual value than that of the stepwise regression model, which indicates that the nonparametric model is better than the multivariate stepwise regression model in dealing with the complicated and changeable problems of urban water consumption prediction.For further explanation, the mean absolute error   is compared with the mean square error   , and the results are as follows.The calculation formulas of   and   are as follows: It can be seen from Table 6 that the mean absolute error and mean square error of nonparametric estimation are much smaller than those of stepwise regression, and the fitting degree of nonparametric estimation is higher than that of stepwise polynomial regression.
The stepwise regression model and nonparametric regression model are used to do out-ofsample prediction on the actual water consumption in 2020, and the actual and predicted values are shown in the table below.The results show that the predicted value of nonparametric regression model is better than that of stepwise regression model.

Conclusions and Policy Suggestions
Based on the statistical data of Nanjing city in Jiangsu province from 2005 to 2019, this thesis analyzes the correlation between urban annual water consumption data and its long-term influencing factors, selects the variable data which has obvious influence on total water consumption, uses the method of stepwise regression and nonparametric regression to show the advantages and disadvantages of the model in data fitting, predicts the data respectively, and draws the following conclusions: Empirical results show that the fitting effect of nonparametric model is better than that of traditional multivariate stepwise linear regression model in annual water consumption fitting.
In the prediction problem, the effect of kernel estimation is also obvious.Therefore, in the longterm prediction, we can use the more accurate kernel estimation method to predict the urban annual water consumption, which can provides supports for the long-term stable urban development, national water conservancy projects and water resources scheduling problems, so as to achieve the purpose of saving economy and rationally allocating and utilizing water resources.

Table 1 .
Model summary e

Table 2 .
variance analysis e

Table 2
is the variance analysis of each model.Since this thesis chooses the best model 4 in the stepwise regression model, the variance test results of model 4 are analyzed.When the F statistic of this model is 107.820, the significance test Sig.= 0.000 < 0.001, which means that the regression equation has passed the significance test (F test), indicating that the established linear regression model has statistical significance.Table 3 is the regression coefficient test of the model 4. The constant coefficient of this model is 35,541.626,and the independent variable coefficients are GRP=2.587,IWC=0.940,ASTC=0.204,DC=0.163.After t significance test, it can be seen that all Sig.values pass the significance test at 5% significance level.Therefore, this model is the best regression model after screening by stepwise regression model.Then we can establish the following regression model formula:

Table 3 .
Regression coefficient test

Table 4 .
Correlation coefficients between independent variables and dependent variables

Table 5 .
Non-parametric regression results

Table 6 .
Comparison of fitting results

Table 7 .
Comparison of mean absolute error and mean square error

Table 8 .
Out-of-sample predicted values