A knowledge-guided and manual intervention-based gene expression programming for PM2.5 concentration prediction

. In view of the lack of interpretation and inability to know the occurrence mechanism of PM2.5 concentration by deep learning algorithm in solving PM2.5 concentration prediction problem, this paper adopts a knowledge-guided and manual intervention-based gene expression programming (KMGEP) to solve it. The KMGEP algorithm not only has strong model learning ability, but also can obtain the explicit function relationship between PM2.5 concentration and its influencing factors. In the process of algorithm implementation, knowledge guidance and manual intervention are introduced to GEP for predicting PM2.5 concentration so as to improve its global optimization ability and convergence speed. In this paper, the daily PM2.5 concentration prediction in winter (from December to February) in Xi'an region is taken as an example, and the KMGEP algorithm is compared with the artificial neural network back propagation algorithm (BP-ANN) and the convolutional neural network and long short-term memory neural network combined model (CNN-LSTM). Experimental results show that the KMGEP algorithm not only has high prediction accuracy in solving the PM2.5 concentration prediction, but also the obtained function expression can reveal the occurrence relationship between PM2.5 concentration and its influencing factors.


Introduction
Air pollution is a serious threat to human health. Especially in the air, particulate matter (PM2.5) with a diameter of less than 2. 5 can adhere to the deep respiratory tract due to its small size, and affect blood circulation by penetrating lung cells, thereby affect human health [1,2]. Studies have shown that the increase in PM2.5 concentration will increase the risk of airway obstructive diseases, chronic bronchitis, asthma, lung cancer and various cardiovascular diseases [3]. Therefore, accurate PM2.5 concentration prediction and the functional relationship between PM2.5 concentration and its influencing factors have become the key research problems for experts and scholars at home and abroad.
In recent years, with the development of artificial intelligence technology, many scholars have used machine learning algorithms to predict PM2.5 concentrations. Chen Cheng etc. used a multi-instance genetic neural network to predict indoor PM2.5 concentration, and its results were better than linear regression, support vector, random forest and other methods [4]. Zheng Guowei etc. established a combined prediction model based on support vector machine-wavelet neural network based on the nonlinear and time-varying characteristics of PM2.5 concentration changes, and its prediction results were better than the single model of support vector machine [5]. Chen Qiang etc. used artificial neural network back propagation algorithm (BP-ANN) and multiple linear regression model to predict the PM2.5 concentration in Zhengzhou, and the results showed that the BP-ANN algorithm was more effective in predicting the PM2.5 concentration in Zhengzhou [6]. Sun Yibo etc. proposed a PM2.5 concentration prediction model based on a deep neural network (DNN), and the results showed that DNN could greatly improve the prediction accuracy by using only aerosol optical depth data and meteorological observation data [7]. Zhao Wenfang etc. established predictive model based on deep learning, and the results showed that the model could effectively improve the prediction accuracy of PM2.5 concentration in the next 24 hours [8]. Li Taoying etc. proposed a combination model of convolutional neural network and long short term memory neural network (CNN-LSTM), and compared the univariate CNN-LSTM model and the multivariate CNN-LSTM model to prove the prediction of the multivariate CNN-LSTM model better [9]. Liu Xulin etc. proposed a deep learning prediction model based on convolutional neural networks and sequence-to-sequence, and the results showed that the model could effectively improve the prediction accuracy of PM2.5 concentration in the next hour and had a high generalization ability [10].
In summary, the deep learning algorithm already has been a hot spot method for PM2.5 concentration prediction, but it cannot get an explicit function expression. Gene Expression Programming (GEP) is a new type of machine learning algorithm proposed by Ferreira on the basis of genetic algorithm and genetic programming, which is inspired by the open reading frame in genetics [11]. GEP not only has the same powerful model learning capabilities as deep learning algorithms, but also can obtain explicit functional relations. At present, the GEP algorithm has been successfully applied in the predictive modelling of software reliability [12], dew point [13], energy loss [14], and concrete mechanical properties [15].
In this paper, KMGEP algorithm is used to solve the problem of PM2.5 concentration prediction in the short-term. Based on the previous research results in PM2.5 concentration prediction, the influencing factors and related functions of PM2.5 concentration are designed according to the requirements of GEP algorithm so as to build a prior knowledge base of GEP algorithm. On the basis of natural evolution, manual intervention is added to enhance the efficiency of the algorithm and improve the quality of the solution. This paper takes the daily prediction of PM2.5 concentration in Xi'an in winter as a case, and finally obtains the functional relationship between PM2.5 concentration and various influencing factors, which can reveal the internal mechanism of PM2.5 concentration and various influencing factors. The degree of fit (R 2 ), mean absolute error (MAE) and root mean square error (RMSE) are used as model prediction evaluation indicators. The effectiveness of the KMGEP prediction model is proved by comparing with the artificial neural network back propagation algorithm [6] (BP-ANN) and the combined model of convolutional neural network and long short term memory neural network [9] (CNN-LSTM) in the literature.

Gene expression programming algorithm
Gene expression programming (GEP) algorithm absorbs the coding form of genetic algorithm (GA) and the tree structure of genetic programming (GP). Its advantage lies in that it relies on its powerful search and evolution ability to find the optimal mathematical expression in line with the training data under the condition of not knowing the internal mechanism of things and only having the training data. The flow chart of GEP algorithm is shown as in Fig. 1.
The steps are as follows: Step 1: [Initialization] including algorithm parameter setting, and the generation of initial population; Step 2: [Abort condition] evaluate the fitness of each individual in the population to determine whether it has reached the maximum fitness value or the maximum number of iterations. If yes, output the final result, otherwise, go to the next step; Step 3: [Mutation] randomly select an individual according to the mutation probability, and perform corresponding mutation operation on the selected individual randomly selected gene location; Step 4: [(IS, RIS, gene) transposition] according to transposition probability randomly select individuals for the corresponding transposition operation; Step 5: [Single point recombination, two points recombination and gene recombination] randomly select individuals for the corresponding recombination operation according to the recombination probability; Step 6: [Selection] use the tournament selection to select individuals to form the next generation population, then go to step 2.

PM2.5 concentration prediction model based on KMGEP
The setting of the influencing factors of the traditional gene expression programming algorithm is subjective to some extent. In this paper, the results of previous studies on the influencing factors of PM2.5 concentration and related functions are summarized and designed as the priori knowledge base of KMGEP. At the same time, for the slow convergence of traditional gene expression programming, the superior individuals in the current population are selected to form a posteriori knowledge base to guide the evolution direction of KMGEP algorithm, and the manual intervention is introduced to improve the evolutionary efficiency of KMGEP algorithm.

Knowledge guidance
Knowledge guidance is realized through the knowledge base. The knowledge base consists of two parts, a priori knowledge base and a posterior knowledge base.
(1) The establishment of a priori knowledge base The priori knowledge base is obtained from previous studies on PM2.5 concentration, which contains the influencing factors of PM2.5 concentration and related functions. Research by Lu Debin etc. showed that the influencing factors of PM2.5 were meteorological factors (relative humidity, rainfall, wind speed, temperature), pollution sources (sulfur dioxide emissions, smoke and dust emissions), urbanization and industrial structure (population density, GDP per capita, ratio of primary, secondary and tertiary industries to GDP), and corporate pollution control and technology optimization (green area, green coverage, sulfur dioxide removal, soot removal, R&D expenditure as a percentage of GDP) [16]. The research of Wu Zhuang etc. showed that in a short period of time, a region's development scale, geographical conditions, industrial pollution emissions and automobile exhaust emissions were relatively fixed, therefore, the change of PM2.5 concentration was mainly related to local meteorological conditions [17]. Through the research of Lv Baolei etc. and Wu Yuan etc., it is found that the meteorological factors affecting PM2.5 are wind, temperature, humidity, and air pressure [18] [19]. Liu Suixin etc. in their research found that atmospheric pollution wind speed and precipitation were the factors affecting the PM2.5 concentration in Xi'an [20]. Meng Zhaowei etc. in their research found that daily average temperature, daily average pressure, daily average wind speed, daily average relative humidity, precipitation and the lowest temperature were significantly related to the mass concentration of PM2.5 [21]. Research by Peng Yan etc. showed that the influencing factors of PM2.5 concentration were PM10, SO2, NO2, O3, minimum temperature, average relative humidity, maximum wind speed, maximum wind direction, and sunshine time [22]. Research by Qu Chao etc. showed that the influencing factors of PM2.5 concentration were PM10, SO2, CO, O3, relative humidity, temperature and wind speed [23]. Research by Aydin Shishegaran etc. found that GEP could use nonlinear equations such as power functions and trigonometric functions to express the impact of meteorological parameters on air quality [24].

A prior knowledge base Parameters
Influencing factors relative humidity ( 7 ) average wind speed ( 8 ) air pressure ( 9 ) sunshine duration ( 10 Precipitation ( 11 ) maximum wind speed ( 12 ) the direction of the maximum wind speed ( 13 ) function sets +, -, *, / log (logarithm based on 10) exp (exponent of e) ln (logarithm based on e) (the exponent of 10) x 2 (the second power of x) sqrt (root) sin (sine) cos (cosine) abs (absolute value) According to the above literature research, it is found that the main factors affecting the PM2.5 concentration in the short-term prediction are the concentration of air pollutants and meteorological factors. Predictive models contain basic functions such as power functions and trigonometric functions with high probability. Based on the above analysis and the characteristics of short-term prediction in this paper, a prior knowledge base is constructed, as shown in Table  1.
(2) The establishment of a posteriori knowledge base The current population was ranked by fitness, and the first m individuals were selected as the members of the posterior knowledge base.

Manual intervention
Manual intervention consists of individual intervention and population intervention.
(1) Individual intervention Individual intervention includes two kinds of operations: repairing operator and strengthening operator.
The individual intervention operation, which aims to improve the quality of population individuals, consists of a repairing operator that removes the morbid genes in individuals and a strengthening operator that spreads eminent genes to the individuals of population. The specific operation method is as follows: Repairing operator: for the inferior gene sites contained in the non-feasible solution in the population, such as making the divisor 0 and the gene site less than 0 under the quadratic radical, the gene loci leading to these inferior gene traits were recorded in the inferiorgene matrix, which was constantly changing in the course of evolution. The specific operations for determining inferior gene sites are as follows: 1) Calculate the effective length of each gene in the individual; 2) Traverse each gene location from right to left to find one or two operands of the gene location, and use the function operator on the gene location to calculate the operands; 3) If "ZeroDivisionError", "ValueError", and "Over-flowError" appear in the calculation process, then their gene sites are an inferior gene site.
The specific steps of repairing operator are as follows: 1) To find the inferior gene sites of the infeasible individuals stored in the inferiorgene matrix; 2) A symbol is randomly selected from the function character set F and the termination character set T to replace the symbols of inferior gene loci one by one; 3) Calculate the fitness of the individual after replacement. If the calculation result is real, the replacement is successful. Otherwise, continue with (2).
Strengthening operator: the specific steps of the strengthening operator are as follows: 1) For each individual in the population, if the number between 0 and 1 randomly generated is less than the probability of strengthening operator, the individual will be optimized. Otherwise, the optimization operation is not performed.
2) For the individuals in the posteriori knowledge base, the gene fragment [s:t] from the position s to the position t of the i-th individual is randomly selected and transplanted to the position s to the position t of the j-th individual selected in step (1) to form a new individual k, and evaluate the fitness of k. If the fitness is greater than the fitness of the original individual j, then replace the original individual j with the new individual k, otherwise keep the original individual j unchanged.
(2) Population intervention Population intervention is aimed at the phenomenon of premature convergence caused by the lack of population diversity in the evolution process, by replacing the same number of poor individuals with new randomly generated feasible individuals and mirror individuals with large differences generated by mirror mapping in the population, in order to increase the diversity of the population, thereby effectively improving the global optimization ability of the algorithm. Population intervention diversifies the function between the expression of PM2.5 concentration and influencing factors, and selects evolution through a more comprehensive function set, so that the algorithm can more accurately express the functional relationship between PM2.5 concentration and various influencing factors.
The paper uses information entropy as a measure of population diversity, and judges whether the current population needs intervention based on the threshold of information entropy. The information entropy is solved as follows: 1) Count the number of occurrences of the i-th function or terminator on the same gene position j of the population ; 2) Calculate the probability of the appearance of the i-th function or terminator on the same gene position j of the population, where N is the size of the population. The calculation formula is as Formula 1: 3) Calculate the information entropy of the population. The specific calculation formula of information entropy is as Formula 2: In Formula 2, L is the total length of individuals, and S is the total number of function and terminator.
If H is greater than or equal to the set threshold, the original population will remain unchanged. If it is less than the set threshold, population intervention will be carried out. The specific operation method is as follows: Population intervention is to sort the population from large to small according to its fitness. For the penultimate b individuals, the b mirror individuals are replaced, and the penultimate b+c individuals are replaced by c random individuals to form a new population.
Mirror individual: function symbol set F={+, -, *, /, sqrt, x 2 , exp, cos, sin, ln, log,~(base 10 exponent)}, mirror function symbol set mirror_F={-, +, /, *, x 2 , sqrt, ln, sin, cos, exp,~, log}, traverse the individuals that need to be mirrored, if the i-th gene is the j-th element in F, the i-th gene of the individual after replacement is the j-th element in mirror_F. According to the abovementioned rules, a new individual is formed after traversing all the gene positions of the individual.
Random individual: the same rules as the initial individual generation.

KMGEP algorithm steps
The article adds knowledge guidance and manual intervention to the GEP to improve the evolution efficiency of the algorithm and optimize the quality of the solution. In the selection operator, the tournament selection operator with elite strategy is used to save the best individuals in the current population to ensure the algorithm converges to the global optimum [25]. KMGEP algorithm steps are as follows:

Comparative experiment
In order to verify the performance of the algorithm in this paper, taking the winter PM2.5 concentration prediction in Xi'an as an example, a comparative experiment was carried out with literature [6] and literature [9].

Data setting
The influencing factors of PM2.5 concentration established in this paper are air quality data (PM10, SO2, NO2, CO, O3) and meteorological data (minimum temperature, average temperature, relative humidity, average wind speed, air pressure, sunshine duration, precipitation, maximum wind speed, the direction of the maximum wind speed). The daily monitoring data of the Xi'an area from 2017 to 2018 (December-February) obtained through the China National Environmental Monitoring Centre. The collected data are daily averages. 70% of the data is used as the training set, and 30% of the data is used as the test set.

Fitness function selection
In the paper, the degree of fit R 2 =1-SSE⁄SST is used as the fitness function, which is the multiple correlation coefficient in statistics. Among them, SSE is calculated as Formula 3, and SST is calculated as Formula 4.
Which represents real data, ^r epresents predicted data, and − represents the average value of real data. When R 2 is closer to 1, it indicates that the prediction accuracy is higher.

Performance evaluation
This article uses three indicators of fit (R 2 ), root mean square error (RMSE), and average absolute error (MAE) to evaluate the prediction results. The root mean square error (RMSE) is calculated as Formula 5, and the average absolute error (MAE) is calculated as Formula 6. Where n is the length of the prediction set, is the i-th true value in the prediction set, and is the i-th predicted value obtained by the model.

Initial parameter setting
The algorithm parameter settings are shown in Table 2.

Results and Discussion
The KMGEP model prediction model expression is shown in Formula 7, and the model prediction performance results are shown in Table 3.
= 3 + (cos ln 7 ) * ln 9 + 2 * 2 + (cos ln 0 ) * ln 10 ln 0 + 6 * 10 Among them, y represents the concentration of PM2.5, 0 represents PM10, 2 represents NO2, 3 represents CO, 6 represents average temperature, 7 represents humidity, 9 represents air pressure and 10 represents sunshine hours. From Formula 7, the functional relationship between PM2.5 concentration and its factors can be clearly seen, and it can be known that through the evolution of the algorithm. The influencing factors of PM2.5 concentration are changed from the original 14 influencing factors of PM10, SO2, NO2, CO, O3, minimum temperature, average temperature, relative humidity, average wind speed, air pressure, sunshine, precipitation, the maximum wind speed and direction of maximum wind speed change to 7 influencing factors of PM10, CO, NO2, humidity, sunshine time, air pressure and average temperature. Related functions changed from the original 13 functions of +, -, *, /, log, exp, ln,~, x 2 , sqrt, sin, cos, and abs to 5 functions of exp, cos, ln, + and *. Finally, the PM2.5 concentration in winter in Xi'an area is mainly related to PM10, CO, NO2, humidity, sunshine time, air pressure and average temperature. When the concentration of air pollutants CO and NO2 increase, the PM2.5 concentration also increases, which is consistent with the research results in the literature. From the above analysis, it can be seen that exponential function, logarithmic function and trigonometric function can better reveal the relationship between PM2.5 concentration and its factors. It shows that the KMGEP algorithm finally obtains functions and influencing factors related to PM2.5 concentration through selection and evolution, survival of the fittest, and the obtained function expression can effectively explain the relationship between PM2.5 concentration and its influencing factors, internal mechanism. For the control of Xi'an winter haze concentration, Formula 5 can be used to control the corresponding variables so as to achieve the purpose of effective control of PM2.5 concentration.
Comparing KMGEP with BP-ANN and CNN-LSTM models, the prediction curves are shown in Figure 1-3. In the figure, The horizontal axis shows the sample points, and the vertical axis shows the concentration of PM2.5, and r_v is the true value, and p_v is the predicted value. The prediction performance indicators are shown in Table 4.   Figure 2-4, it can be found that KMGEP algorithm in this paper is better than the BP-ANN and CNN-LSTM algorithms. From Table 4, it can be seen that the KMGEP calculation is 0.03 higher than the BP-ANN fit, and the average absolute error is 2.08 lower. The root mean square error is lower by 3.43. The fit is 0.02 higher than that of CNN-LSTM, and the average absolute error is lower by 1.84, and the root mean square error is lower by 2.32. The degree of fit, root mean square error and average absolute error are better than the other two models. BP-ANN and CNN-LSTM are based on deep learning models. The final model is a correlation parameter matrix, and the relationship of PM2.5 concentration and its factors is not visible. KMGEP algorithm finally obtains the specific functional relationship between PM2.5 concentration and its influencing factors, which can reveal the mechanism of PM2.5 concentration. CNN-LSTM and BP-ANN are determining the PM2.5 concentration influencing factor. The above is mainly based on correlation analysis, while correlation analysis mainly analyzes linear relationships. It is not enough for nonlinear relationship analysis. The application of the results of predecessors in PM2.5 concentration research is insufficient, which makes the influencing factors of the input model incomplete, and the KMGEP algorithm is based on the prior knowledge base established by previous achievements is used as the influencing factor, and the determination of influencing factors is more comprehensive, and the experiment proves that the accuracy is higher. In summary, the above analysis proves the competitiveness and advancement of the KMGEP algorithm in PM2.5 concentration prediction.

Summary
PM2.5 concentration prediction is of great significance to the prevention and control of haze. This paper presents a knowledge-guided and manual interventionbased gene expression programming. KMGEP introduces knowledge guidance and manual intervention on the basis of GEP. KMGEP algorithm establishes a prior knowledge base of influencing factors and related functions of PM2.5 concentration by mining previous research results of PM2.5 concentration prediction. A posteriori knowledge base is established by preserving high quality genes in the evolutionary population, and the algorithm can optimize the evolution more accurately and effectively through the guidance of knowledge. Then, the KMGEP algorithm maintains the diversity of the population by introducing manual intervention, and finally finds the optimal solution of the problem. In this paper, the proposed algorithm was applied to the PM2.5 concentration prediction in Xi 'an, and compared with the neural network-based algorithm in the literature, the results show that KMGEP algorithm not only effectively improves the accuracy of PM2.5 concentration prediction, but also can clearly see the relationship between the influencing factors of PM2.5 concentration and haze, which provides an important reference for haze control in China. In the future, the KMGEP algorithm can also be applied to other air mass concentration studies and other intelligent prediction fields.