Development of a statistical forecasting model for PM 2.5 in Macau based on clustering of backward trajectories

. A daily PM 2.5 forecasting model based on multiple linear regression (MLR) and backward trajectory clustering of HYSPLIT was designed for its application to small cities where PM 2.5 level is easily affected by regional transport. The objective of this study is to investigate the regions that affect the fine particulate concentration of Macau and to develop an effective forecasting system to enhance the capture of PM 2.5 episodes. By clustering the HYSPLIT 24-hr backward trajectories originated at Macau from 2015 to 2017, five potential transportation paths of PM 2.5 were found. A cluster based statistical model was developed and trained with air quality and meteorological data of 2015 and 2016. Then, the trained model was evaluated with data of 2017. Comparing to an ordinary model without backward trajectory clustering, the cluster based PM 2.5 forecasting model yielded similar general forecast performance in 2017. However, the critical success index of the cluster based model was 11% higher than that of the ordinary model. This means the cluster based model has better model performance in PM 2.5 concentration prediction and it is more important for the health of the public.


Introduction
Due to serious adverse effects to respiratory and cardiovascular systems, PM 2.5 has become a major concern of the public and the Chinese government. In the past decade, China has been experiencing a severe PM 2.5 pollution due to rapid urban development. Many cities do not comply with the annual PM 2.5 standard. These cities of nonattainment are mostly distributed in three major city clusters including Beijing -Tianjin -Hebei (BTH), Yangtze River Delta (YRD) and Pearl River Delta (PRD) [1]. Macau a gaming and tourism city located in the southern part of PRD, is having similar situation. In order to protect its citizens, it is important to develop a daily forecasting model of PM 2.5 for Macau.
Till the third quarter of 2018, this city accommodates a population of 663,400 in an area of 30.8 km 2 . Due to the small geographical area, statistical model can be a more attractive alternative to large scale deterministic models when developing operational air quality model [2]. The air quality of Macau is not only governed by the local emissions and the dispersion conditions. Its PM 2.5 level is also susceptible to the transboundary pollution of neighbouring regions. Therefore, reflecting the regional influence of fine particulates within the forecasting model is necessary. Understanding about the distribution of Macau's upwind cities, where the fine particulates are generated and transported by the atmospheric flow is important.
In view of this, the objective of this study is to develop a cluster based PM 2.5 statistical forecasting model for performing 1 day ahead forecast of daily averaged PM 2.5 concentration in Macau. Comparing to deterministic models, the large quantity of historical measured data under a variety of conditions in statistical approaches often have higher accuracy [3]. To achieve this, the historical backward trajectories originated at Macau between 2015 and 2017 are analysed and clustered into several groups by K-means clustering in HYSPLIT. Then, a specific statistical model is developed based on current day local meteorological factors, local time-lagged air pollutant concentrations, hourly PM 2.5 concentration at mid-night from several upwind cities identified in the classified trajectory clusters. The cluster based PM 2.5 forecasting model is then trained by using the data of 2015 and 2016, and the dataset of 2017 is used for evaluation of the model. Details of the methodology are described in the following section. dataset consists of hourly concentrations for PM 2.5 , PM 10 , SO 2 and NO 2 measured at 5 monitoring stations (http://www.smg.gov.mo/smg/airQuality/c_air_stations.h tm) from 2015 to 2017. The meteorological datasets consist of measured data (2015-2016) and forecasted data (2017) at the headquarter of the Macau Meteorological and Geophysical Bureau (22°09´36"N 113°33´54"E). The meteorological dataset contains 5 parameters including the wind speed, wind direction, mean sea level pressure, temperature and relative humidity.
Due to the small area of Macau, its PM 2.5 level is not only influenced by the local pollution sources. It can be easily affected by the regional transport from its neighbouring upwind cities. Therefore, it is reasonable to adopt the PM 2.5 concentrations of upwind cities as the model inputs. The data of upwind PM 2.5 concentrations were obtained from the real-time air quality release platform of the China National Environmental Monitoring Centre (CNEMC). The website reports hourly concentrations of PM 2.5 measured at 367 cities, with a time lag of 1 hour. However, one difficulty that immediately arises is the choice of the upwind cities, which is addressed in the following subsection.

Clustering analysis of 24-h backward trajectories
Previous works showed that upwind pollutant concentration of the previous day could influence the performance of air quality forecasting models [4][5][6], meaning that the choices of upwind cities are important for model development. In this study, the choices for Macau were investigated systematically by using the 24h backward trajectories with the same starting height of 500 meter for the period between 2015 and 2017. These trajectories were calculated by using the HYSPLIT trajectory model of NOAA. Each trajectory was set to start from Macau at midnight, and the 24-h trajectory end point represented the estimated location of upwind polluted air at midnight on the previous day. All the 24-h backward trajectories for each year between 2015 and 2017 were then classified by K-means into 5 clusters as shown in Fig. 1. Each column in Fig. 1 shows the clusters of trajectories for a particular year. The general pattern was revealed more clearly by using the mean trajectory of each cluster as in Fig. 2. It shows that the clusters of different years were similar to each other, meaning that this pattern is repeatable over time. The pattern indicates that the regional influence of PM 2.5 in Macau could be due to (i) the transport of inland fine particulates from the northern part of PRD (north and north-west cluster occurring mostly during winter), (ii) the transport of fine particulates from coastal cities between the Bashi Channel and the Taiwan Strait (northeast cluster occurring mostly during winter and spring), and (iii) the transport of marine aerosols from the South China Sea (southern clusters occurring mostly during summer).

Cluster based forecasting model of PM 2.5
The multi-linear regression (MLR) was shown to be an effective tool for air quality forecasting [7]. Here, the MLR was adopted to develop the cluster based forecasting model of PM 2.5 . The general form of the forecasting model is: where represents the measured PM 2.5 concentration of the k th day, which is equal to the spatial average of available daily averaged PM 2.5 concentrations measured at 5 monitoring stations. It is predicted by the linear combination of the input variables with an intercept term while represents the vector of coefficients corresponding to the linear combination.
is the vector of input variables available on the (k-1) th day.
is the modelling error associated with the forecasting model. Table 1 shows the compositions of input variables adopted to develop the forecasting model: In Table 1, the symbols RH, PSEA, TEMP, WSPD and U represent the daily averages of the relative humidity, mean sea level pressure, temperature, wind speed and the north-south component of the wind direction, respectively. These variables were used to reflect the atmospheric stability, the available wind for dilution and the nature (inland or sea) of the trajectory. Besides the meteorological variables, the local daily averaged pollutant concentrations of the (k-1) th day and the hourly PM 2.5 concentration at 11:00PM of the (k-1) th day in Macau were also incorporated to reflect the initial air quality condition of the k th day. Finally, the hourly PM 2.5 concentration at 11:00PM of the (k-1) th day for several upwind cities were selected and used to reflect the regional influence of fine particulates.

Results and discussion
The model was trained with air-quality and meteorological data of 2015 and 2016. Even though the model involved meteorological inputs of the k th day that were practically not available on the (k-1) th day, the measured meteorological data of the k th day were still used as the model inputs in order to produce better training results. When the trained models were evaluated against the data of 2017, the meteorological forecasts of the k th day were used instead of the measurements. The model accuracy was evaluated according to the measures shown in Eqns. (4) to (9). In these equations, P, O, and represent the prediction, the observation, and the annual average of observed PM 2.5 concentrations, respectively. For the overall accuracy, the coefficient of determination (R 2 ) describes how well the model variance explains the observation variance. The normalized mean absolute error (NMAE) describes the overall magnitude of the forecast errors relative to the mean of the observations. The index of agreement (IA) is a dimensionless indicator between zero and one. IA = 1 means perfect prediction, while IA = 0 means no agreement at all [4]. This index can detect additive and proportional differences between the observations and the predictions. As for the prediction performance during high PM 2.5 concentrations, the episode detection rate (EDR), represents the probability of successful hits under the observed exceedances (PM 2.5 > 35 g ), and the false alarm rate (FAR) describes the proportion of false alarms under the forecasted exceedances. In this study, the hit or alarm threshold was 35 g . This value corresponds to the daily limit of the primary PM 2.5 standard of US-NAAQS, which is equivalent to the grade I PM 2.5 standard of China (GB3095-2012). The critical success index (CSI) is an integrated measure of EDR and FAR and it can be treated as a discounted EDR and the penalty depends on the FAR of the model.
(9) Table 2 shows the performance of cluster based PM 2.5 forecasting model and the ordinary PM 2.5 forecasting model during training period and evaluation period. The ordinary PM 2.5 forecasting model is a subset of the cluster based model without including the hourly PM 2.5 concentrations of the upwind cities. During the training phase, it is noted that the cluster based model outperforms the ordinary model due to the extra model complexity to fit the training data. During the evaluation phase, both the ordinary model and the cluster based model achieve similar general performance (R 2 , NMAE, and IA). However, the critical success index (CSI) of the cluster based model is significantly higher than that of the ordinary model, meaning that the inclusion of upwind cities is important.  It is noted that predictions by either the cluster based model or the ordinary model are generally in good agreement with the measurements. However, the cluster based model is more capable to capture the peaks. As the most primary objective of air quality forecast model is to correctly predict the episodes, the development of the cluster based model in this study is successful.

Conclusions
An empirical PM 2.5 forecasting model was developed for Macau based on multiple linear regression and backward trajectories clustering of HYSPLIT. The backward trajectories clustering revealed 3 potential regional sources of PM 2.5 for Macau. A cluster based statistical forecasting model was developed and the regional influence of PM 2.5 was reflected in the model by using the hourly PM 2.5 concentrations of the upwind cities before the midnight. Two years of data (2015)(2016)