Short-term Predictions of PM 10 Using Bayesian Regression Models

. One of the air pollutants that poses the greatest threat to human health is PM 10 . The objectives of this study are to develop a prediction model for PM 10 . The Multiple Linear Regression (MLR) and Bayesian Regression (BRM) models were constructed to forecast the following day's (Day 1) and next two days' (Day 2) PM 10 concentration. To choose the optimal model, the performance metrics (NAE, RMSE, PA, IA, and R 2 ) are applied to each model. Jerantut, Nilai, Shah Alam, and Klang were chosen as monitoring sites. Data from the Department of Environment Malaysia (DOE) was utilised as a case study for five years, with seven parameters (PM 10 , temperature, relative humidity, NO 2 , SO 2 , CO, and O 3 ) chosen. According to the findings, the key factors responsible for the unhealthy levels of air quality at the Klang station include carbon monoxide (CO), nitrogen dioxide (NO 2 ), sulphur dioxide (SO 2 ), and ozone (O 3 ) from industrial and maritime activities, which are thought to influence PM 10 concentrations in the area. When compared to MLR models, the results demonstrate that BRM are the best model for predicting the next day and next two days PM 10 concentration at all locations.


Introduction
PM is an abbreviation for "particulate matter," which refers to all of the solid and liquid particles that are suspended in the air and include such from dust, pollen, soot, smoke, and liquid droplets.The particles' size and composition may vary [1].According to a study [2], PM has a variety of physical qualities, including particle size and quantity, total surface area, electrostatic properties, and biological and chemical components.PM is a composite of several chemical species, rather than a single pollutant.A national study of metropolitan populations in the United States discovered that short-term exposure to PM components was related with higher mortality [3].The health effects of PM features were studied, and it was discovered that the toxicity of PM varied depending on particle size, location, and season [4].The yearly open burning of biomass on Sumatra, Indonesia has worsened the air quality in Malaysia, resulting in a transboundary haze.In Indonesia, forest fires have risen due to El Nino, inappropriate forest management, and lack of fire suppression [5].PM 10 concentrations in Asia and the Pacific continue to be the leading cause of air pollution issues [6] considered the most harmful pollution in Peninsular Malaysia and Southeast Asia [7] Prediction model are useful tool for predicting PM 10 concentrations in Malaysia since they are meant to reduce autocorrelation or inaccuracy in the model.The statistical modelling has the capacity to predict PM 10 concentrations with high precision [8,9].In the framework of statistical study, meteorological data and data gathered from monitoring air pollution were used as predictive tools in regression procedures.When compared to nonlinear models, linear models are easier to understand and are more straightforward.A multiple linear regression (MLR) model was used as one of the modelling strategies used to estimate PM 10 concentrations [10][11][12].As a result, this study was conducted to forecast PM 10 concentrations using a MLR and a Bayesian regression model (BRM).This study's objectives are to identify the descriptive analysis of PM 10 , meteorology, and gaseous parameters; to predict PM 10 concentrations for the next day (Day 1) and the next two-days (Day 2) using MLR and BRM; and to determine the best model for predicting PM 10 concentrations for the next day and the next two days using Performance Indicators.

Description of study area
The four stations used to forecast PM 10 concentration were Nilai, Shah Alam, Klang, and Jerantut.In central Peninsular Malaysia, these stations serve as a background station (Jerantut), an industrial area (Nilai), and two urban areas (Shah Alam and Klang).

Data
Monitoring data were gathered from the Malaysian Department of Environment (DOE) for the period of January 2008 to December 2012.Data analysis was done using multiple linear regression and the Bayesian regression model.Particulate matter smaller than 10 microns is the dependent variable in this study, and the independent variables are temperature, relative humidity, sulphur dioxide, nitrogen dioxide, carbon monoxide, and ozone.Table 1 provides a summary of variable.The data from the air monitoring system was split into two groups: training and validation.80 percent of all monitoring data as used for training data, and the remaining 20 percent was used for validation.

Statistical analysis
Multiple linear regression (MLR), the Bayesian regression model with non-informative prior (BRM-NIP), and the Bayesian regression model with conjugate prior are the three regression models employed in this study (BRM-CP).

Multiple linear regression (MLR)
There is a value of the dependent variable Y for each value of the independent variable X.
The general equation for a multiple linear regression with independent variables given n observations is shown in Equation (1).
In a regression model, β's is treated as a constant while Y and X are treated as variables.By minimizing the sum of the squares of the vertical deviation from the data point to the line, statistical software determines the least square model's best fitting line for the observed data [13].Temperature (T), relative humidity (RH), nitrogen dioxide (NO 2 ), sulphur dioxide (SO 2 ), carbon monoxide (CO), ozone (O 3 ), and particulate matter at that time (PM 10D0 ) are the parameters used in the model.However, for some cases, such as missing data and limitation of data, the independent parameter is not included in the regression model.

Bayesian regression (BRM)
The posterior distribution P (θ|x) is proportional to the product of the likelihood function P(x|θ) and the prior distribution of the parameter P(θ) [12].Bayes rule is applied in Bayesian statistics by,

Pr (Posterior distribution) α Pr (gamma) x Pr (normal or uniform) (3)
Beta's (β's) and X are both considered random variables in Bayesian statistics.As a result, the likelihood distribution for x is thought to be normal and the previous distribution for beta is uniform or normal [14].In this study, the prior distribution used is Beta (β) is uniform and Tau (τ) following a gamma distribution and Beta (β) following normal distribution [15].Priors are classified into two types: conjugate priors and non-informative priors.A conjugate prior is defined as a normal-gamma prior when the posterior distribution has the same form as the prior distribution.When there is little or no information on the prior, a non-informative prior is utilised [16].Based on a random sample of 2000 observations, this study inferred [15].Bayesian analysis often uses the Markov Chain Monte Carlo (MCMC) method.The techniques used to sample the posterior probability distribution are Gibbs sampling and Metropolis-Hasting [17].Based on the software WinBUGS, the Gibbs sampling methods running from the R package are used to get the posterior distribution of the parameters [13].

CO
3 Results and Discussions

Descriptive analysis for PM10
The result and summary of descriptive analysis is presented in Table 3.The comparison of PM 10 concentration is referred to the Malaysian Ambient Air Quality Guidelines (MAAQG) interim target-1 (2015) since the datasets is between 2008 to 2012.The Klang station has the highest mean PM 10 concentration (266 g/m 3 ), followed in decreasing order by Shah Alam (212 g/m 3 ), Nilai (160 g/m 3 ), and Jerantut (104 g/m 3 ).For all locations, the distribution of PM 10 concentration is biassed to the right.All monitoring stations have positive values for the kurtosis.Klang station (2.09) has the highest skewness value, followed by Shah Alam (1.60) and Nilai (1.09).The skewness score of 0.60 for the PM 10 concentration at Jerantut station shows a normal distribution.A measurement of the dispersion of the data is the computed coefficient of variance (CV).According to the CV values, Jerantut and Klang monitoring stations have the least variable dispersion of PM 10 concentration, at 0.38, followed by Shah Alam (0.34) and Nilai (0.29).Jerantut and Klang station exhibit a comparable variance for all PM 10 concentrations in this investigation.The monitoring station at Nilai had a lower CV, which showed that the data on PM 10 concentration were dispersed less.This study's analysis reveals that high particulate event happened at three monitoring stations (Shah Alam, Klang, Nilai) between 2008 and 2012.For five years, the average concentration at the Shah Alam, Klang, and Nilai stations was above 50 g/m 3 annually.ThePM 10 concentration at the Jerantut monitoring station is under the interim target-1 of the Malaysian Ambient Air Quality Standard (MAAQS).It is comparable to the outcomes shown by [7] where the concentrations of all pollutants are lower and do not exceed the threshold in Jerantut monitoring station.

Multiple linear regression model (MLR)
The model's contributors are summarised by the standardised coefficients.The stronger contribution of that variable to the concentration is shown by high values of the standardised coefficient.Table 4 displays the best MLR for PM 10 for the following day and the following two days for the monitoring stations at Jerantut, Shah Alam, Klang, and Nilai.R 2 values for the Jerantut monitoring station range from 0.519 to 0.803.This demonstrates that the regression model's fitting is good.The main factors for predicting PM 10 at the Jerantut monitoring station in 2012, according to regression models for the prediction of Day 1 PM 10 concentration, are temperature, particulate matter on that day (PM 10D0 ) and ozone, which have the highest coefficients values and R 2 values (0.803).The model reveals that the key contributors to PM 10 concentration on that day (PM 10D0 ), NO 2 , and relative humidity, which have the greatest R 2 values, in 2012 at Shah Alam monitoring station are particulate matter (PM 10 ), NO 2 , and relative humidity (0.688).Temperature, relative humidity, and particulate matter on that day (PM 10D0 ) are shown in the model to have the highest contributions for forecasting the next day's PM 10 concentration in 2012 at the Klang monitoring station (0.745).Nilai is close to a significant rail and air transportation network.Based on this research, the model predicts that relative humidity, NO 2 , and PM10 at that day (PM 10D0 ) with the greatest R 2 value (0.685) will be the key contributors for predicting the next day's PM 10 concentration in 2011.The summary of the MLR prediction model is shown in Table 5.

Bayesian regression model (BRM)
The BRM model was proposed in this study to illustrate a certain model parameter assumption.The prior distribution that was employed was uniform, and it followed both the normal and gamma distributions.Priors can be classified into two categories: noninformative priors (NIP) and conjugate priors (CP).The results of the BRM for the concentration of PM 10 at the monitoring stations at Jerantut, Shah Alam, Klang, and Nilai are shown in Tables 6 and 7, respectively.

Conclusion
The purpose of the study is to identify the descriptive analysis of PM 10 , meteorology parameters, and gaseous parameters.This study intended to predict the PM 10 concentrations for the next day and the next two days using MLR and BRM models.The descriptive analysis shows that the highest mean PM 10 concentration occurred at Klang station followed by Shah Alam, Nilai and Jerantut.The MLR indicates the good model for PM 10 concentration prediction in Jerantut, Nilai and Klang for certain years.The performance indicator was applied to MLR, BRM-CP, BRM-NIP model to obtain the best model.The results obtained show that the BRM-CP is the best model to predict the next day and the next two-day PM 10 concentration at all locations in study.

Table 1 .
The summary of variable Calculating the performance indicators allows for an evaluation of the model's performance.Coefficient of determination (R 2 ), index of agreement (IA), prediction accuracy (PA), normalised absolute error (NAE), and root mean square error (RMSE) have all been employed as performance indicators.The formula for the performance indicator obtained derived from Table2.

Table 2 .
The performance indicator

Table 3 .
The descriptive analysis of PM10 concentration for 2008-2012 *SD: standard deviation *CV: coefficient of variation

Table 4 .
The best MLR of the next day and the next two days of PM10 at Jerantut, Shah Alam, Klang and Nilai monitoring station

Table 6 .
The best normal and gamma prior distribution for the next day and the next two days of PM10 at all monitoring station

Table 7 .
The best uniform and normal prior distribution for the next day and the next two days of PM10 at all monitoring station To determine a prediction model's practical accuracy, the validity of models is tested.The study of MLR and BRM is based on the statistical comparison of the model findings with the actual PM10 concentration.According to Tables8, 9, 10, and 11, performance metrics are utilised to compare MLR and BRM for PM10 models.The coefficient of determination (R 2 ), index of agreement (IA), prediction accuracy (PA), normalised absolute error (NAE), and root mean square error (RMSE) are the performance indicators used in this study to identify the best model.

Table 9 .
Performance index for PM10 concentration prediction at Shah Alam monitoring station, 2008-2012