Prediction of Particulate Matter (PM 10 ) during High Particulate Event in Peninsular Malaysia using Novel Hybrid Model

. High Particulate Events (HPE) contributes to the deterioration of air quality, as the fine particles present can be inhaled, leading to respiratory diseases and other health problem. Knowing the adverse effects of air pollution episodes to human health, it is crucial to create suitable models that can effectively and accurately predict air pollution concentration. This study proposed a hybrid model for forecasting the next day PM 10 concentration in peninsular Malaysia namely Shah Alam, Nilai, Bukit Rambai and Larkin. Hourly air pollutant concentration (PM 10 , NO x , NO 2 , SO 2 , CO, O 3 ) and meteorological parameters (RH, T, WS) during the HPE events in 1997, 2005, 2013 and 2015 were used. Support Vector Machine (SVM) and Quantile Regression (QR) was combined to construct a hybrid models (SVM-QR) to reduce the number of input variables. Performance indicators such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Index of Agreement (d 2 ) were used to evaluate the performance of the predictive models. SVM-QR model resulted good performance in all areas. SVM-3 was selected as the best model at Bukit Rambai (MAE=5.72, RMSE=9.71) and Shah Alam (MAE=11.89,


Introduction
Extremely high concentration of Particulate Matter (PM) in ambient air or known as high particulate event (HPE) usually occurred during haze episodes.These severe episodes occurs when the concentrations of PM far exceeded the Malaysian Ambient Air Quality Standard (MAAQS) for PM10 concentration which was 150 µg/m 3 for 24-hour average [1].
Harrison et al. [2] reported that peat land fire is a significant contributor to high particulate events in Southeast Asia.The increasing severity of air pollution conditions can be attributed to the potent impact of wildfire pollution originating from neighbouring countries, coupled with the influence of the southwest monsoon, which typically intensifies biomass burning during the dry and hot weather conditions prevalent during this season [3,4].
The escalation of air pollution has resulted in significant health consequences which includes higher mortality rates, respiratory and cardiovascular diseases.Numerous academic studies have demonstrated the impact of air pollution on the respiratory and circulatory systems [5][6][7].Since air pollution is a problem that has been occurred over decades, there are a lot of researches that has been conducted to predict air pollutant concentration.
One of the most common method used is statistical approach was linear regression as it was foreknown due to its simplicity and reliability especially dealing with linear distribution.However, most of the studies focuses on overall mean of PM10 levels that is not appropriate to be used during extreme condition.Linear regression may not provide accurate predictions in some complex situations such as non-linear data and extreme values data [6].
When all devastating effects of air pollutants considered, it is crucial to create suitable models to predict air pollution levels in order to determine future concentrations or to locate pollutant sources.Therefore, this study proposed a hybrid model for forecasting of PM 10 concentration in selected areas in peninsular Malaysia during HPE.Feature selection process was proposed to reduce the input variables before developing the hybrid models.

Data Acquistition
Four stations located in peninsular Malaysia namely Larkin (Johor), Bukit Rambai (Melaka), Nilai (Negeri Sembilan) and Shah Alam (Selangor) were chosen as location study.Table 1 describe the background of each of the monitoring stations.Larkin, Bukit Rambai and Nilai was located at the southern region of peninsular Malaysia while Shah Alam was located at near the central peninsular of Malaysia.It also known as urban-industrial area with residential and commercial areas surrounded by busy motorways [8].All of these stations were prone to the transboundary smoke from the Sumatera regions as these stations resides at the west coast of peninsular Malaysia.

Data Pre-processing
Firstly, the missing observation of all air pollutant parameters were first fill-in before the analysis were done.These missing data will be treated by using Linear Interpolation (LI) method using IBM SPSS Software Version 26.It is important to fill in the missing data before any analysis because the success of the modelling depends on the quality of the dataset [9,10].
A random selection of 80% of the data was used to develop the model, and the remaining 20% of the data was used to evaluate the model's accuracy.

Feature Selection by using SVM weighting
Feature selection is the process of reducing the number of input variables when developing a predictive model [11].This method used in data pre-processing to achieve efficient data reduction [12].In this study, a filter based feature selection method which is Support vector machine (SVM) was selected.SVM are a set of supervised learning methods used for classification, regression and outlier detection [13].This selection method was conducted by using SVM based operators in RapidMiner Studio version 9.10.This operator uses the coefficients of the normal vector of a linear SVM as attribute weights and calculates the relevance of the attributes by computing for each attribute of the input dataset and the weight with respect to the class attribute [14].This weight by SVM operation works as a filter process and ranked the air pollutants parameters before it can be used for modeling.

Quantile Regression
Quantile Regression (QR) will generates a set of coefficients and equations at all quantiles which can examine the entire distribution of the variable of interest rather than a single measure of the central tendency of its distribution [15].Therefore, QR method is able to provide a more holistic picture of the effects of predictors at various PM10 distributions.Given a random variable y with right continuous distribution,   =   (  ≤ ).The quantile regression () with (0,1) is defined as follows [16]: The quantile also formulated as the solution to minimize problem: where  is equal to n observations;  = specified percentile value (0.1,0.2,0.3…,0.9);  = dependent variable (predicted PM10 level);   are the explanatory variables (air pollutants and weather parameters);  is the y-intercept with a dependency on the τ (constant term);  ̂ are the slope coefficients for each explanatory variable with a dependency on the τ.This research study used SPSS version 26 software to develop models based on nine percentiles, specifically at the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 90th percentiles.The performance across different percentiles was studied to understand how well the models fit the data at each specific quantile.Consider the result of the model performance evaluation, the percentile that show minimum error and high accuracy was chosen as the best percentile to be used in developing the hybrid model.Then, a hybrid were developed by combining the filter-based feature selection methods called Support Vector Machine (SVM) with Quantile Regression (QR).

Performance Indicator
In order to evaluate the performance of the regression model, several performance indicators such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Index of Agreement ( 2 ), and Prediction Accuracy ( 2 )) was used to describe the goodness of fit for the predicting models of PM10 concentration.The performance indicator formulae was shown as in table 2[10]: Root Mean Squared Error (RMSE) Index of Agreement ( 2 ) 3 Result and Discussion

Optimizing the Number of Predictors (SVM Weight)
The SVM weights were ranked according to their attribute weights; the higher the attribute weight, the more significant the variable was for the purpose of developing a hybrid models.Figure 1(a) to 1(d) illustrate the SVM weight for all study areas.CO was selected as the most significant parameter for all monitoring stations.CO which mainly released by motor vehicle and machinery that used diesel fuel seem to have the strongest correlation with PM10 concentration.Since all locations in this study are classified as industrial and or urban area, it may contribute to these results.Furthermore, the seasonal fires from Indonesia can also be the main contribution since it was during haze [17].The weighting by SVM order for Bukit Rambai starting to differ after the second rank as it was followed by O 3 > NO x > T > RH > SO 2 > NO 2 and WS.Larkin and Nilai has the same parameter rank from first to third order which is CO, SO 2 and NO 2 respectively and begin to differ after.For Larkin, the parameters that ranked from the 4th to 6th were NO x > RH > WS > T and O 3 ; meanwhile, Nilai O 3 > WS > RH > NO x and T. Shah Alam showed the ascending order of weight by SVM parameter as follows ; CO > NO

Quantile Selection
In order to select the best quantile to be modified with SVM features, the performance of QR method was evaluated.Table 4 shows the summary of best selected percentile of PM10 concentration prediction for the next-day at the 4 study areas.Table 4 presents the summary of the best percentile chosen for each monitoring station in the predictions for the next day PM10 concentration.The selected percentile for all stations fell within the range of 0.5 and 0.6.The percentile selected for best prediction performance at Bukit Rambai, Nilai and Larkin was 0.6, except for Shah Alam which achieved a percentile of 0.5.The percentile that exhibits the lowest error was determined to be the ideal percentile for implementation in hybrid modelling.Opting for percentiles in the intermediate range of 0.5 to 0.6 enables the model to achieve equilibrium between capturing the central tendency of air quality conditions, which represents typical or average values, and accommodating a certain degree of extreme or exceptional occurrences [18].Overall, hybrid model exhibits exceptional performance across all stations.Table 5 shows the performance measures of hybrid model for next day PM10 concentration at all study areas.Bukit Rambai has the best performance with 3 selected parameters (SVM-3) with the value of MAE=5.72,RMSE=9.71 and   =0.99.Similar with Bukit Rambai, SVM-3 model also show the best performance at Shah Alam.=0.95, IA=0.98), respectively.The performance of hybrid model was compared with two others traditional regression method which is QR and QR model exhibits least error in comparison to MLR methods at all monitoring stations except for Shah Alam.However, QR does not surpasses the performance of hybrid model.Overall, these results imply that the SVM-QR model was the most accurate predictive model to predict PM10 concentration during HPE.The performance indicators results for SVM-QR for all stations calculated less error and greater accuracy by compared to MLR and QR method.The SVM feature selection approach was an excellence method to optimize the number of selected parameters in predicting the PM10 concentration.

Conclusion
In conclusion, SVM-QR is an excellent alternative method for predicting PM10 concentration.This model saves training time by reducing the feature size given in the data representation, and prevents learning from noise, also known as overfitting, to improve accuracy.The proposed model can accurately predict maximum daily air pollution episodes for the next days; it can be used as an early warning tool in giving air quality information to local authorities to formulate air quality improvement strategies.

Table 1 .
Specific location of the monitoring stations and background

Table 2 .
The formula of performance indicator

Table 3 .
Summary of best selected percentile

Table 4
shows the summary of the best prediction model of for the next day PM10 concentration according to model number and optimum input parameter.The result of SVM-QR model selected input parameter is vary depending on the location.SVM-1 model with one input parameter which is CO was the best model for Larkin and Nilai.Hence, Bukit Rambai and Shah Alam best next day prediction model was SVM-3 with 3 selected input parameter which is CO, O 3 , NO x and CO, NO 2 , NO x , respectively.It also can be concluded that CO has the strongest correlation with PM 10 concentration especially during HPE.

Table 4 .
The summary of the best prediction model according to model number and input parameter

Table 5 .
Performance measure of hybrid model for next day PM10 concentration at all study areas Larkin and Nilai recorded the SVM-1 as best model (with only 1 optimum input parameter) with the value (MAE=7.22,RMSE=13.38,R 2 =0.85,IA=0.95) and (MAE=6.88,RMSE=11.84,R 2