ARIMA and Multiple Linear Regression Additive Model for SO2 Monitoring Data’s Calibration

- SO2 is one of the main air pollutants produced by industrial waste gas, civil combustion and automobile exhaust. Real-time monitoring of the concentration of SO2 can grasp the air quality in time and take corresponding measures to the pollution sources. Monitoring data may be affected by the internal factors and the external factors. ARIMA was used for the internal factor as A. Meteorological factors were taken as external factors, and the difference of SO2 between the standard data and monitoring data was taken as dependent variable. Multivariate linear regression was modeled as B. Time series calibration model was obtained Y=A+B. The error analysis showed that the accuracy of SO2 was improved. The additive model could effectively calibrate SO2 monitoring data.


Introduction
Environmental pollution is becoming more and more serious with the development of industrialization and urbanization. Air pollution has brought a huge threat to human health. The 19th National Congress of the Communist Party of China emphasized that "Green water and green mountains are golden and silver mountains". In recent years, air quality in China has generally improved, but the situation is still grim, for example the treatment of acid rain is still very difficult. SO2 is one of the main pollutants in the air, and the main precursor to the formation of acid rain. It's also one of the three types of carcinogens published by International Cancer Research Institute of World Health Organization [1]. Industrial waste gas, civil combustion, automobile exhaust and so on will produce a large number of SO2 emissions. Realtime and accurate monitoring can effectively control and control pollution [2].
However, due to the restriction of economy and other factors, the setting of national measurement points often cannot meet the requirements of accurate fixed-point, real-time, accurate and economic monitoring. The selfdeveloped micro air monitor has huge market value because of its flexible and economic type. The basic principle of SO2 monitoring is based on the ultraviolet light to excite SO2 molecules through filters, and produce fluorescence when it attenuates back to the basic state. The fluorescence intensity is proportional to the SO2 concentration. The SO2 concentration in the air can be calculated by measuring the fluorescence intensity after amplification by photomultiplier tube [3]. The micro air monitor has certain requirements for environmental conditions. The calibration of the instrument is often completed under the standard environmental conditions such as constant temperature and humidity. In the actual application of natural climate environment, the change of temperature, humidity, wind speed, pressure and precipitation and other meteorological factors will have a certain impact on the accuracy of its monitoring data [2]. Therefore, we need to analyse and calibrate the real environment monitoring data.
The data was from the mathematical modeling competition of college students in 2019. It included the monitoring data of SO2 by NCD and SDD. Five meteorology factors, i.e. wind, pressure, precipitation, temperature, and humidity were also given. It was found that SO2 conformed to time series. ARIMA model could be used to describe the trend before and after its own data. For the influence of the meteorological factors, multiple linear regression models could be used to describe the influence of the meteorology factors.
Our paper was structured as follows. Part2 was the exploratory analysis for the monitoring data of SDD and NCD. This part included statistical description and hypothesis testing. Part 3 was difference analysis between the two groups. This part included the correlation analysis and the autocorrelation analysis. Part 4 was the time series calibration model based on ARIMA and multiple linear regression. Part 5 was the error analysis. The relative errors were computed and analysed. Part 6 was the conclusion.

Exploratory analysis
In this part, We use statistical description, hypothesis test and other statistical research methods for the intercepted data in the same time period to preliminarily explore the difference of SO2 monitoring data between NCD and SDD, as well as the relationship between SO2 and other possible influencing factors.

Statistical description
We computed the basic statistics of SO2 monitoring data by NCD (n=4130) and SDD (n=4200) in the same period.
Taking the day as the unit, we calculated the mean value of daily SO2 monitoring data to explore the seasonal variation. The trend of the daily mean value of SO2 was fluctuation ( Figure 1). It was higher in autumn and winter, and lower in spring and summer. The difference between the maximum values and the minimum values were large in November, December and January. There was a high peak in January, and the values in other months were basically in a relatively low level and small fluctuation. Taking per hour as the unit, we calculated the mean value of each whole time in hours to explore the variation in a day. The mean value per hour of SO2 showed a double-peak feature (Figure2.2). The hourly variation trend of SO2 could be seen that it was higher at 6 and 7 points in a day, then it slowly declined, the lowest at 13 and 14 points, then it slowly raised, the highest at 19 and 20 points, and then it declined again. The monitoring data of NCD was slightly higher than that of SDD. NCD-so2 SDD-so2

Figure 2
The mean value per hour of SO2 by NCD and SDD

Hypothesis testing
Paired t-test was used for SO between NCD and SDD. It should be noted that the data of SDD was not complete at the whole point. So, we used two methods of calculation (Table 1).
(1) The nearest point of the whole point time as the whole point monitoring data.
(2) The mean of half an hour before and after the whole point time as the whole point monitoring data. The paired t-test showed that there were significant differences between the two groups (P < 0.05).

Difference Analysis
In this part, we studied the correlation of SO2 between NCD and SDD, and the correlations between SO2 and the five meteorological factors. Then, we studied the autocorrelation of SO2.

Correlation analysis
Correlation analysis showed that SO2 between NCD and SDD was correlated (r=0.38392, P<0.0001). They were correlated between SO2 of SDD and the meteorological factors (Table 2. 1, P<0.0001). They were negative correlations with wind, precipitation and temperature, positive correlation with pressure and humidity.

Autocorrelation analysis
Autocorrelation and partial autocorrelation showed that the autocorrelation of CO was high significance ( Figure  3, P<0.0001). Therefore, it was considered that CO monitoring data belong to time series data, and time series analysis methods could be used in the following study.

Time series calibration model
In this part, the data from NCD was considered as the standard data. We remodeled SO2 of SDD combined with meteorological factors. We divided the variation of the dependent variable ( ) into two parts. Its internal factor ( ) and the external factor ( ). The internal factor was caused by its autocorrelation. The external factor was caused by meteorological factors. The two parts were additive.
SO2 of SDD was considered as time series data. So, A was the predicted value of SO2 of SDD based on ARIMA. Considering external meteorology factors, the difference between NCD and SDD was the dependent variable ( =NCD-SDD), and meteorology factors were the independent variables (COL1~COL5, i.e., wind, pressure, precipitation, temperature, humidity). B was modeled based on multiple linear regression.

A based on ARIMA
ARIMA model was a famous time series model proposed by Box and Jenkins. It mainly included the following three forms [4] .

AR (Auto-regressive):
. MA (Moving-Average): ARMA: Since the time interval of the monitoring data of SDD was inconsistent and the lowest common multiple was huge, it was considered that it may lead to higher bias of the model if the huge time interval was ignored. To prevent it, we took every five minutes of the time point as the observation point from the whole point on. The mean of the value within every five minutes was computed as the observation value of this point. Finally, 2000 time points were obtained as the samples for modeling. It was a week continuous time series data. The parameters of model were estimated by the maximum likelihood method [4] .
The ACF and the PACF of SO2 showed that it was basically stable by first-order difference. So, the difference order was set as d=1. By comparing the BIC values, we got the minimum BIC (=-1.2471) of ARIMA model when p=0 and q=1. So, ARIMA (0, 1, 1) was finally used to predict SO2 of SDD. Parameter estimation was shown in Table 3, and model prediction was shown in Figure 4.

B based on multiple linear regression
The parameter estimations of Multiple Linear Regression were shown in Table 4. The ANOVA results were shown in Table 5(R 2 =0. 932). The MLR function was as follows.

Discussion
In this part, we mainly focused on the prediction validity of the model. After removing the samples for the modeling, the remaining samples were used to test the prediction precision. We compared the predictive values (PV) and the standard values (SV), and calculated the average relative error to evaluate the calibration effects.
We got the predictive values by the additive calibration models and the ARIMA models. We also compared the monitoring data of SDD. The average relative errors were computed as follows ( Table 6). The average relative error of ARIMA was the highest, and the lowest. The accuracy has been improved.

Conclusion
Through the exploratory analysis of SO2 monitoring data, it was found that the observation variables have certain timing and autocorrelation. At the same time, through the correlation analysis, it was also found that some correlations between the observation variables and other influencing factors. The paper suggested that SO2 monitoring data might be affected by the internal factors and the external factors. ARIMA was used for the internal factors. Meteorological factors were taken as external factors, and the difference of SO2 between the standard data and monitoring data was taken as dependent variable. Multivariate linear regression was modeled as B. Time series calibration model was obtained Y=A+B. The prediction precision the validity of additive calibration model was verified.
Our model still had some shortcomings to be improved. First of all, due to the lack of data, we only calibrated from the data point of view. There was no quantitative analysis and discussion on the physical factors such as zero drift and range drift of the electrochemical gas sensor that will be used for a long time [6]. Secondly, the interaction factors were not considered in the construction of multiple linear regression. That was where our model should be improved in the future.