Data Driven Air Quality Prediction based on Mobile Measurement

The temporal and spatial prediction of nitrogen dioxide (NO2) is very essential because of its harmful impacts on the environment. Its forecasting would help, for example, to regulate predictively the traffic flow. Traditionally, air quality measurements are performed at fixed locations or dedicated mobile laboratories. In this work, we installed a measurement technology in a vehicle and connected it to the vehicle measuring system in order to be able to evaluate further parameters. To this end, we selected one route profile and continuously measured the NO2concentration in real-time traffic. We have driven this route profile several times in succession. The rationale of this approach is the idea that several vehicles are equipped with the same measurement technology and drive on the same route profile within the same time. The contribution of this work is to forecast the NO2concentration for a given route profile under constant weather conditions based on mobile measurements. To this end, we divided the recorded data into training and test data and investigated five different approaches for forecasting the NO2concentration on the respective route profile. Among other aspects, we used cross-validation methods in order to assess the prediction quality. Results show that sliding-window approaches using the averaging of previous rounds are most suitable for predicting NO2concentration. Furthermore, our data reveal that the prediction quality is improved when the test data immediately follow the training data.


Introduction
The NO 2 concentration is one of the harmful pollutants to the environment and public health.One of the causes for the formation of this pollutant is industry and traffic [1,2].Usually, official traffic air monitoring stations, located next to the roadside, measure the air quality.The hourly limit value in European cities for NO 2 concentration, which may be exceeded 18 times a year, is 200 µgm -3 and the average annual limit value is 40 µgm -3  [3].In this regard, cities aim at improving the air quality, and thus, protect the human health and avoid financial sanctions.In addition to classical air quality studies based on fixed locations or mobile laboratories [4,5], there are scientific studies based on mobile measurements.For instance, Elen et al. and Liu et al. describe a vehicular based mobile approach for measuring fine-grained air quality and other pollutants in real time [6,7].Furthermore, bicycles have already been used to measure ultrafine particles, black carbon, and carbon monoxide [8,9].The advantage of this measurement method over the stationary measuring station is its proximity to potential emitters in road traffic, the dynamics of the measurement, and greater coverage.
In addition to measuring the current state, predicting air quality is also of particular importance.The development of prediction models helps to provide early warnings to the population and take actions before tolerance limits are exceeded.In this respect, scientific approaches exist to make both, temporal as well as spatial forecasts.In [10, 11, 12 and 13], for example, attempts on the basis of stationary measurement data were made to predict the NO 2 concentration using various methods in the field of machine learning.Alternatively, different dispersion models based on mathematical or statistical approaches have been applied in order to investigate the spatial forecasting of pollutants [14,15].
In order to extend the prediction models based on stationary measuring stations or models, we equipped a vehicle with NO 2 measuring technology for our work.The measuring device was connected to the measuring system of the vehicle so that we also recorded specific vehicle parameters from internal (CAN) messages.Afterwards, we selected a route profile in which there is also a stationary measuring station next to the roadside.Subsequently, we drove this route profile 23 times and continuously measured the NO 2 concentration in realtime traffic.After dividing the measured data into training and test data, we investigated the prediction quality of different techniques, for example, using crossvalidation.The contributions of this work are: • We investigate different techniques in order to forecast NO 2 concentration with maximum accuracy for the selected route.
•We propose a lightweight, mobile measuring system that includes both, environmental as well as car-specific parameters, and thus, allows for more precise data collection and a higher accuracy in the prediction phase.
•A comprehensive comparison of the prediction techniques used, based on real-world data.

Background
In this section, we describe in more detail the selected measurement technology and measurement methods.Furthermore, we explain the driving route profile.Subsequently, the method of cross-validation is introduced to assess the forecasting approaches.

Measurement Technology
We installed the measuring device "NO 2 / NO / NO x Monitor Model 405 nm" from "2B Technologies, Inc." on the vehicle.The principle of measurement is a direct measurement of NO 2 by absorbance at 405 nm in the concentration range 0-10,000 ppb with an accuracy of 2 ppb.
This device is designated as a Federal Equivalent Method (FEM) for NO 2 compliance monitoring in the United States but not in Europe [16].Therefore, the measuring method does not comply with the official European directives and thus is not an officially licensed measuring technique.In comparison, the measurement method of road measuring stations according to the 22nd Ordinance to the Federal Immission Control Act (BImSchV) for NO 2 concentration is based on the chemiluminescence method [3,17].However, both methods mentioned above have already been compared with each other, concluding that the data quality of our measuring device used is given [18].

Route Profile
We used the measuring technique described in the previous section in order to measure the NO 2 concentration on a route profile.This route is 680 m long and can be understood as a circular with right turn process.Fig. 1 shows schematically the respective start and end point of a circle as well as at which points traffic lights and the air measuring station are located.We drove the route profile a total of 23 times in succession in order to make a statement about how the NO 2 concentration changes on the route for following vehicles.The rational of this procedure is the idea that several vehicles are equipped with the same measurement technology and drive on the same route profile within the same time.External conditions regarding weather and time, which have an impact on the NO 2 concentration [4,5], can be seen in Table 1.

Table 1.
External conditions to weather which we assumed to be constant.

Cross-Validation
After we have presented the route profile on which the NO 2 concentration was measured, we explain in more detail one of the methods used in this work.
Cross-validation methods are test methods of data analysis that are used, among other aspects, in data mining where the goal is prediction.There are several types of this method, such as simple, stratified or leaveone-out cross-validation [19,20].In this work, the process of simple cross-validation is described in more detail and then applied.
In k-fold cross-validation, the forecast results are evaluated by partitioning the original data set into k equally sized (length m) subsets T 1 ,…,T k consisting of N elements.Furthermore, a distinction is made between a training set and test set.k passes are started in which the i-th subset of T i is used as the test set and the remaining k-1 subset of T i is used as the training set.The total error rate is calculated as an average from the individual error rates of the k individual runs [19,20,21].

Forecasting Methodology
In this section, we describe in more detail the data preparation.Subsequently, we explain the forecasting approaches, used in this paper, and their application.Finally, we present an estimation for the quality of each forecasting approach.

Data Preparation
As described in the previous section, we connected the measurement device to the vehicle's measuring system.All channels or information stored on the measuring system have different transmission frequencies (asynchronous) because of different priorities on the CAN.This means, for example, that the value for the driving speed is transmitted more frequently in a time unit than the value for the NO 2 concentration.Consequently, all recorded signals must be normalized to the same equidistant form in order to be able to deduce the distance covered from the driving speed.
In this work, we chose a linear interpolation based on the time vector with an increment of 0.2 s.In order to 2 infer distant-equivalent signals from time-equivalent ones, asynchronous distance-based units were initially extracted from the velocity signal.Subsequently, we used the linear interpolation (based on the distance) with an increment of 1 m in order to calculate signals with an equivalent distance.Afterwards, we created a 23 x 680 matrix.The row size stands for the number of rounds and the column size for the NO 2 concentration recorded or interpolated per 1 m.
In addition, we have taken into account the delay time and the reaction time of our measurement device by shifting the measurement data of the NO 2 concentration by a certain time interval.This is particularly important to determine the location of the measured NO 2 concentration.

Forecasting Approaches
In order to predict the NO 2 concentration along the route as described in Section 2.2, we used different approaches to determine the NO 2 concentration for the subsequent round.Depending on the approach, we have taken the individual rows of all columns as test or training data sets.In Fig. 2, we show the different approaches, in particular, how differ in the training data used.
Approach 1 describes the forecasting of the following round based on the previous round.We have assumed that the next round will be the same as the previous round.We considered the one round as training data and the directly following round as test data.These data were then compared with one another in order to make a statement about the forecasting quality.
For the approaches 2 & 3 we calculated the mean over the complete distance of previous laps.For approach 2, we have not assumed the number of previous rounds to be constant.Rather, the number of rounds per iterative step increased so that we then calculated the mean value for these rounds (always starting with round 1).Finally, we compared the mean (dashed box over the previous laps) with the round directly after (i.e., the round predicted to be predicted).
Approach 3 differs from approach 2 by the number of rounds over which we have calculated the mean value.Here, we have calculated the mean over the last three, four, five and six rounds directly before the round to be predicted (sliding-window).These data were then compared with one another.
In approach 4 we assumed that the NO 2 concentration of each round can be forecasted by calculating the mean value of all driven rounds except the round to be predicted.Especially here we have used the method of cross-validation from Section 2.3.The round to be predicted can be seen as test data and the mean value of the remaining rounds as training data.
In the last approach, we also applied the method of cross-validation.Compared to the first approach, we compared not only the following rounds but every single round with the other rounds.For example, the first round was assumed as training data and the remaining data as test data.These data were then compared with one another in order to make a statement about the forecasting quality.

Forecasting Quality
For all approaches, presented in the previous subsection, we discuss the forecasting quality in more detail in the following.To this end, we calculated the percentage error per approach between the training and test data set in order to be able to make a statement about the forecast quality.To this end, we have assumed the Root Mean Squared Percentage Error (RMSPE) as a measure for assessing the forecasting quality.It indicates how well test data is adapted to existing training data, or how much a forecast deviates on average from the historical data (i.e., actual observed values).The larger the RMSPE, the greater the deviation from the model.In the literature, the RMSPE is used in many prognosis error evaluations, for example in regression based and statistical methods [22,23].The RMSPE is calculated as follows [23]: The difference between the training and test data set of each column entry (the NO 2 concentration per 1 m on the route profile) is calculated and set in relation to the training data set per entry.Afterwards, the ratio is squared and set in relation to the total data size.Finally, the root is calculated.Using this approach, we calculated the prediction error of a single (predicted) round against other rounds.
Afterwards, we use box-plot diagrams to asses which approach in section 3.2 can be used in order to realize the lowest prediction error (mean).The box-plot diagrams are suitable for representing the distribution and dispersion of the RMSPE.To this end, the error of each approach on mean (+) and median (dashed-line) as well 25th and 75th percentile and the minimum and maximum deviation are displayed in Fig. 3.

Forecasting Methodology
In this section, we present the results for our five approaches and discuss them afterwards.As an example, we will describe the results and the prediction quality for the first approach in more detail.To this end, we calculated the RMSPE between (i+1)-th round (testing data) and (i)-th round (training data).For this approach, we assumed that the next round corresponds to the previous round.We show the errors related to this approach in Table 2.The left column indicates the two rounds (observed & predicted) used for prediction.The right column specifies the respective RMSPE.The specified error in percent indicates how much the NO 2 concentration of the following round deviates on average from the previous (actual observed) round.
The table shows that the RMSPE varies widely from round to round.This depends, among other aspects, on the amount of traffic per round.It is also important in which intensity and form a vehicle ahead emits NO 2 , which is then absorbed by the measuring device (exhaust gas plume).We have not classified these as outliers in our database.It can therefore be assumed that RMSPE is only low if the traffic volume is identical to the previous round under constant weather conditions.
Consequently, with this value, no reliable can be made whether the NO2 concentration for the next round will be higher or lower.Since presenting all tables would not fit into the page limit, we visualize the RMSPE values for all five approaches by means of box-plot diagrams in Fig. 4. Because the third approach is based on sliding window, we took the mean values of the NO2 concentration of the previous three, four, five and six rounds as training data.These were then compared with the NO2 concentration of the following round.Subsequently, the RMSPE was also calculated for each window size separately.Due to the largest maximum deviation varying considerably from the smallest one, we have divided the display into two areas with different scaling.The bottom range has an interval of 50 % ending at 200 % and the upper range has an interval of 1000 % starting at 500 %.
First of all, our data reveal that the approaches 1, 4 & 5 have the greatest dispersion.If the maximum errors are considered these results have a RMSPE of more than 500 %.The minimum, median and mean errors of these three approaches are also larger than for approach 2 & 3. Similarly, the NO 2 concentration of individual rounds differs considerably.Hence, we conclude that the result of predicting other rounds based on one round (including cross-validation) is worse than for the other two approaches.Based on our evaluation, we argue that approaches 1, 4 & 5 are, on average, unsuitable for forecasting subsequent rounds under the assumption of unchangeable weather parameters.
The dispersion of the second and third approach are considerably smaller.Hence, the prediction with these methods is suitable on average.However, our data reveal that the basis on which the forecast is based with at least 50% accuracy requires larger historical data series than one.The second method seems to produce lower RMSPE, but a sliding-window method leads to a similar prediction quality.
In the fourth approach, a mean is formed analogous to approaches 2 and 3.However, the crucial difference is the correlation between the training data and the test data directly following the training data.This shows that the mean value computation, based on previous rounds, only leads to a low RMSPE if the test data follows the training data directly.

Conclusions and Future Work
In this paper we have first described how we measured the NO 2 concentration.Subsequently, we investigated different approaches in the field of data mining, with the help of which the NO 2 concentration of the following rounds can be forecasted.We have used both simple approaches and methods of cross-validation.In order to show the error between the training and test data, we first calculated the RMSPE and then visualized the distribution or dispersion in a box-plot diagram.
We have shown that approaches without previous averaging of the NO 2 concentration over certain rounds generate higher RMSPE than approaches with averaging.We have also shown that averaging is not sufficient for significantly low RMSPE.Rather, a training data set must be generated directly before the test data set.Different traffic conditions and, among other aspects, waste gas plumes result for each round of traffic, so that approaches with no averaging are unsuitable.
The current work can be expanded by equipping several vehicles with suitable NO 2 measuring technology in order to make a comprehensive statement about the quantity and quality of the measurements.Based on this, predictions for route sections can then be derived using the methods described above.Furthermore, the NO 2 concentration should also be measured at different weather conditions and times (especially rush hour) on a selected route in order to take into account the influences of these parameters.

Fig. 1 .
Fig. 1.Route profile which was driven on 23 times and consists of three traffic lights and a traffic air monitoring station.

Fig. 2 .
Fig. 2. Different approaches to forecast the next round.The dashed lines indicate the rounds over which the mean value is calculated.Approaches 2 and 3 are similar and differ only in the size of the training data set (sliding-window).Cross-validation was used especially for the last two approaches.
Number of elements in the column (680) T = Training data P = Testing data (Predicted data)

Fig. 3 .
Fig. 3. General structure of the box-plot diagram

Table 2 .
The table shows the RMSPE for the first approach, described in Fig.2