Water quality index prediction with hybridized ELM and Gaussian process regression

The Department of Environment (DOE) of Malaysia evaluates river water quality based on the water quality index (WQI), which is a single number function that considers six parameters for its determination, namely the ammonia nitrogen (AN), biochemical oxygen demand (BOD), chemical oxygen demand (COD), dissolved oxygen (DO), pH, and suspended solids (SS). The conventional WQI calculation is tedious and requires all parameter values in computing the final WQI. In this study, the extreme learning machine (ELM) and the radial basis function kernel Gaussian process regression (GPR), were enhanced with bootstrap aggregating (bagging) and adaptive boosting (AdaBoost) for the WQI prediction at the Klang River, Malaysia. The global performance indicator (GPI) was used to evaluate the models’ performance. By preparing different input combinations for the WQI prediction, the parameter importance was found in following order: DO > COD > SS > AN > BOD > pH, and all models demonstrated lower prediction accuracy with a lesser number of parameter inputs. The GPR revealed a consistent trend with higher WQI prediction accuracy than ELM. The Adaboost-ELM works better than the bagged-ELM for all input combinations, while the bagging algorithm improved the GPR prediction under certain scenarios. The bagged-GPR reported the highest GPI of 1.86 for WQI prediction using all six parameter inputs.


Introduction
Poor quality of surface water has a critical impact on human health as well as the ecosystems consists of flora and fauna. Considering that most daily water supply is sourced from rivers, a better river water quality management system is strongly demanded by policymakers. If poor river water quality is predicted, some preventive measures can be taken immediately. In Malaysia, water quality (WQ) related studies mainly focused on urban rivers located in the major city where the pollutants are municipal sewage [1]. In terms of environmental and ecological problems, the number of WQ parameters is quite extensive [2] hence robust mathematical techniques would be required to combine the physicochemical characterization of water into a single variable to describe the WQ. Water quality index (WQI) is a single number which uses a set of physicochemical water parameters to express the quality of water at a certain place and time. The water quality of rivers in Malaysia is ranked into five classes by the Department of Environment (DOE) based on the WQI [3], which is a function of six WQ parameters: ammonia nitrogen (AN), biochemical oxygen demand (BOD), chemical oxygen demand (COD), dissolved oxygen (DO), pH, and suspended solids (SS). The conventional method suggested by the DOE requires lengthy transformations to estimate subindices. Besides, the sub-indices required the inclusion of different equations, which need tremendous effort and time to calculate the final WQI. Any missing of data would render the calculation of WQI not possible. Hence, machine learning methods like artificial neural networks (ANNs) were used as alternatives for the WQI estimation based on raw data [2].
ANNs are a class of nonlinear models frequently adopted classification, clustering and forecasting of time series. Through the discovery of the inherent knowledge that are presence within the data, the neural network (NN) recognizes the relationship between input and output without explicit physical considerations. The common types of ANNs include multi-layer perceptron NN (MLPNN), radial basis function NN (RBFNN), general regression NN (GRNN), back-propagation NN (BPNN) and extreme learning machine (ELM) [4]. Hore et al. [5] proposed MLPNN with back-propagation comprised of a single hidden layer with 30 neurons for river WQI prediction. Gazzaz and co-workers [6] presented three layers MLPNN to model Kinta River WQI using multiple variables and reported that DO was the most important parameter followed by BOD, AN, pH and COD. Similarly, Hameed et al. [7] found that RBFNN outperformed BPNN in modelling Langat River WQI and DO was the most effective parameter correlated with the predicted WQI followed by BOD, COD, SS, AN and pH. The authors also challenged their models to predict the WQI without BOD input and they discovered that the RBFNN outperformed the BPNN. They explained that the Gaussian RBF was able to mimic the pattern phenomena chaotic disturbances and nonlinear dynamics of the WQ parameters that contributed to the fluctuation of WQI.
The ELM is a type of simple learning algorithm applied to the single hidden layers feedforward neural networks architecture in which the model selects hidden nodes randomly whereby the output weights of the network are determined analytically. It has been reported that ELM achieved better generalization performance and higher learning speed as compared to that of conventional ANN [8]. Some researchers [9] employed ELM with sigmoid activation function, radial, online sequential and optimally pruned to predict DO in Cumberland Basin of USA based on water temperature, specific conductance, turbidity, and pH. The results showed that the optimally pruned ELM demonstrated highest prediction accuracy and outperformed MLPNN. Yi and co-workers [10] adopted ELM to predict chlorophyll-α in Nakdong River of Korea based on WQ and meteorological data, and reported that ELM achieved higher fitting and lower prediction errors than that of BPNN, adaptive neuro-fuzzy inference system (ANFIS) and linear regression (LR). To the best of our knowledge, the ELM has yet to be applied for WQI prediction in Klang River Basin. In this study, the performance of ELM in predicting WQI with less parameter inputs was compared to the RBF-based Gaussian process regression (GPR) model. We aim to develop machine learning (ML) models that capable to predict WQI with less parameter inputs, which has practical application for WQ modelling in case of any data missing scenario due to sensor faulty or human errors. Two data fusion techniques, namely bagging and AdaBoost were hybridized with the base ELM and GPR models to improve their prediction performance.

Data
The data used in this study were obtained from the DOE of Malaysia, which is a bimonthly collected WQ data at station 1K08 along Klang River. WQ parameters include DO, BOD, COD, pH, AN, and SS were collected over the period of 20 years and amounted to 343 data points for this study. Located in Klang River catchment basin, Klang River and its tributaries were mostly polluted with ammonia and often categorized as not healthy [11]. The location of the selected station is illustrated in

WQI calculation
The formula employed by the DOE for calculating the WQI is presented as Eq. (1) represent the sub-indices of DO in saturation percentage, BOD, COD, SS, AN and pH, respectively. The best fit equations for the subindices' estimation is shown in Table 1. The index score ranges from 0 to 100 where the higher the WQI indicates the higher quality of river water.

Extreme Learning Machine (ELM)
The topological structure of ELM is illustrated in Fig 2.

Fig. 2. Architecture of ELM
The ELM randomly selects input weight ( ) and bias ( ) for hidden nodes and analytically determines the output weight ( ) so that the least-square solution could be obtained [12]. For a dataset contains samples, the output function ( ) of the network with hidden nodes and activated by sigmoid function, , can be computed as Eq. (2), Eq.

Gaussian Process Regression (GPR)
Unlike conventional ML models that learn exact values for each parameter in a function, the GPR serves as a nonparametric Bayesian approach to solve regression problem by producing a distribution of predicted values for better simulation on various arbitrary complex systems. The probability distribution is often assumed to be Gaussian distribution, where a point prediction could be obtained based on the mean of the distribution and the uncertainty quantification using its variance. By using different types of kernel functions, GPR allows user to add parameters and information about the shape of model. The RBF kernel function, , was adopted in this study as illustrated in Eq. (5), where represents the length scale of the kernel and , is the Euclidean distance [13]. The hyperparameters of the kernel function was optimized by minimizing the negative log marginalized likelihood (NLML), which can be referred in the said literature [13].

Bagging
Bagging is a data-based simulation approach which is aimed at reducing prediction variance without affecting prediction bias. Resampling of data is performed on original dataset with replacement to generate ' ' number of bootstrap datasets, where ' ' classifiers are generated as each bootstrap dataset containing unique data. Each of these bootstrap datasets are provided with a model and bootstrapping estimate is computed as Eq. 6 [14].

Adaptive boosting
As an iterative approach, adaptive boosting (AdaBoost) algorithm learns from the mistakes of weak learners and turns them into stronger ones by calculating the weight for the samples in each data point. The set of training data points are likely to have a larger say when a higher weight is assigned. The AdaBoost algorithm that improves stability and convergence speed could be referred in the previous literature [15].

Performance indicators
The predicted WQI was compared to the actual WQI, and the root mean square error (RMSE), mean absolute error (MAE), mean bias error (MBE), mean absolute percentage error (MAPE) and coefficient of determination (R 2 ) were calculated based on Eq. (7-11): Where indicates the number of observations, and represent the actual and predicted WQI, respectively. To ease comparing and ranking the model, the statistical performance metrics were combined into a global performance indicator (GPI) in Eq. (12), ∑ Where indicates the median of scaled values of the performance metric and represents the scaled values of the performance metric for the model [16]. The coefficient is set to 1 for all performance metrics except for R 2 where = -1 was assigned. All indicator values were scaled from 0 to 1 to prevent any single indicator to carry dominant effect. The model with a larger positive GPI value indicates a better performance over the other models.

Selection of input parameters
The water quality data was fed into the base ELM and the GPR for the WQI prediction, where the predicted values were compared to the actual WQI calculated using the empirical formula. The water quality parameter with the least impact on the computed GPI was removed from the model, and the eliminated parameter would not be considered in successive combinations to ensure a consistent trend in parameter elimination process. The pH was the first water quality data to be removed from computation of the WQI followed by the BOD, AN, SS and COD. It is worth to mention that both the theoretical and practical results support DO to be the most influential parameter in determining the WQI but a conflict on the BOD ranking was observed. In view of a heavier weightage is allocated for the BOD in the empirical model, the removal of the BOD has a minimum impact on the GPI performance. This could be due to the fact that both the BOD and COD represent amount of oxygen needed on decomposing organic pollutants in the water bodies, where the COD has covered certain features in BOD and the removal of the latter parameter did not significantly deteriorate the model. Owing to the same reason, the COD becomes the second most influential parameter after DO in the prediction model. On the other hand, DO, COD and SS showed their consistency in the order of influence on the WQI estimation for both theoretical calculation and prediction. These three parameters are the major water quality parameters and contribute considerable impacts to the WQI prediction in this study.

Performance of base models
Tables 2 and 3 present the prediction performance of ELM and GPR, respectively. From the tables, both ELM and GPR models reported high GPI in CM6 when all water quality parameters were provided for the WQI prediction. For the ELM, the GPI for CM4 dropped tremendously due to the spike in the RMSE, MBE and MAPE values along with lower R 2 .
In contrast, the GPR shows a consistent trend on predicting the WQI at higher accuracy when more input parameters were provided. In case of CM1 where only the DO is provided for the WQI prediction, the GPR outperformed ELM with R 2 > 0.6, which shows that the former could approximate the trend of the WQI with a certain degree of accuracy. The GPR model that received 5 parameters (CM5) achieved the highest GPI in the WQI prediction with R 2 = 0.97, thus confirming the reliability of the model in WQI prediction without pH input.  Tables 4 and 5 show the prediction performance metrics for the bagged-GPR and bagged-ELM models, respectively. The bagged-GPR demonstrated a consistent increasing trend on the GPI with more parameter inputs. For the bagged-ELM, the trend is quite unusual where the CM5 demonstrated a low GPI and this could be probably due to the spike in RMSE along with a lower R 2 . Besides, the bagged-GPR also demonstrated higher GPI for all prediction scenarios over the bagged-ELM, and achieved the highest GPI of 1.86 for CM6 where all 6 water quality parameters were provided to the model for prediction.  The prediction performances of the bagged-ELM and bagged-GPR are tabulated in Tables 6 and 7, respectively. As can be seen from Tables 6 and 7, both Adaboost-ELM and Adaboost-GPR show obvious improvement on the GPI with more water quality parameter inputs. For CM6 which represents a full parameter input scenario, the Adaboost-GPR demonstrated slightly higher GPI of 1.814 than that of 1.789 reported by the Adaboost-ELM. Besides, a consistent trend on the R 2 values could be observed in Adaboost-ELM with different input scenarios, typically in CM6, CM5, CM4 and CM1 where the R 2 greater than 0.9 indicates a good fitting between the predicted and actual WQI.  Table 8 compares the performance of the bagging and AdaBoost on improving the WQI prediction of the ELM and GPR under the different input combination formats. The base ELM model showed inconsistent performance in predicting the WQI. For instance, the ELM model that received 4 parameter inputs (CM4) has a lower GPI that that of 3 parameters (CM3) and 2 parameters (CM2), respectively. Introducing data fusion techniques like bagging and Adaboost into the ELM model has improved model consistency by demonstrating a trend where the prediction accuracies increase with more input parameters provided. The Adaboost-ELM has outperformed the bagging-ELM and standalone ELM models, for almost all input combinations except for the CM1. It is worth to mention that all the ELM-based models show negative GPI with the CM1, which indicates that the WQI prediction solely based on DO may not be convincing. A similar result was also reported for all GPR-based models. In contrast, all the ELM-based models performed very well in CM6 where all parameters were provided, similar to those recorded in the GPR models.

Comparison of base and hybrid models
An obvious trend can be observed when increasing the number of parameter inputs would improve the GPR performance. The highest GPI was recorded in CM6 with all six parameters input, where bagging and Adaboost significantly improved the performance of the base GPR model. Unlike the ELM model, the Adaboost algorithm was unable to conquer both bagging and conventional GPR in any input combination. As a result, bagged-GPR exhibited the best prediction performance in CM6, CM3 and CM2, where the base GPR demonstrated highest GPI for CM5, CM4 and CM1, respectively. By looking at the scenario where all six parameters were provided to the model, both the bagging and Adaboost algorithm further improved the WQI prediction performance of the base models, and the GPR-based hybrid models outperformed the ELM-based hybrid model by demonstrating higher GPI values.

Conclusions
In this study, water quality data in different input combination formats (CM1-6), were fed into the standalone and data fusion-hybridized (bagging and boosting) ELM and GPR for the WQI prediction. Most models demonstrated their best performance when all six WQ parameters were provided as input, except for the GPR where CM5 has demonstrated higher GPI over CM6. This study reveals that DO is the most influential water quality parameter while the pH is the least impactful parameter in the WQI prediction. However, the prediction of WQI by using solely DO as input parameter was not reliable as demonstrated by high prediction errors. Complementary water quality parameters such as the COD, SS and AN are mandatory for good prediction. Most of the models require three or more parameters to achieve a reasonable WQI prediction with MAPE less than 5 %. For the base model comparison, GPR shows better performance in predicting WQI when inputs ≥ 4. As compared with bagging, the Adaboost worked better on ELM by attaining higher prediction accuracy regardless of inputs combinations. In contrast, the bagged-GPR demonstrated higher GPI than the Adaboost-GPR in most combinations. Despite the base model has outperformed both hybrid models in CM5, CM4 and CM1, the Adaboost and bagged-GPR exhibited higher GPI for CM6, and thus confirmed that effectiveness of data fusion algorithm in enhancing the model prediction. Given that model performance varies in accordance with the level of pollutants, similar modelling work can be repeated in different stations to explore the generalization capability of the models. Besides, future studies may focus on adopting feature selection tools to reduce input dimension, using different data fusion or hybridization techniques and adopt deep learning models like gated recurrent unit in WQI prediction.