Improving the performance of streamflow forecasting model using data-preprocessing technique in Dungun River Basin

An accurate streamflow forecasting model is important for the development of flood mitigation plan as to ensure sustainable development for a river basin. This study adopted Variational Mode Decomposition (VMD) data-preprocessing technique to process and denoise the rainfall data before putting into the Support Vector Machine (SVM) streamflow forecasting model in order to improve the performance of the selected model. Rainfall data and river water level data for the period of 1996-2016 were used for this purpose. Homogeneity tests (Standard Normal Homogeneity Test, the Buishand Range Test, the Pettitt Test and the Von Neumann Ratio Test) and normality tests (Shapiro-Wilk Test, Anderson-Darling Test, Lilliefors Test and Jarque-Bera Test) had been carried out on the rainfall series. Homogenous and non-normally distributed data were found in all the stations, respectively. From the recorded rainfall data, it was observed that Dungun River Basin possessed higher monthly rainfall from November to February, which was during the Northeast Monsoon. Thus, the monthly and seasonal rainfall series of this monsoon would be the main focus for this research as floods usually happen during the Northeast Monsoon period. The predicted water levels from SVM model were assessed with the observed water level using non-parametric statistical tests (Biased Method, Kendall’s Tau B Test and Spearman’s Rho Test).


Introduction
Dungun River Basin is chosen as the study area for this research study. It is one of the flood prone areas in Peninsular Malaysia and had attracted the attention of a group number of researchers. The phenomenon of the flood could result in a lot of destructions to the livelihood and the local populations within the affected area. In the year-end of 2016, the National Disaster Management Agency (NADMA) in Malaysia had reported that heavy rain has caused flooding in the Terengganu state and consequently forcing hundreds to evacuate. Dungun was one of the affected areas.
There were some research studies done towards the Dungun River Basin, like flood forecasting and early warning system [1] and hydrological extreme flood event [2]. For data-preprocessing technique, VMD which is more robust to sampling and noise was introduced lately, as an alternative to Empirical Mode Decomposition (EMD) model [3]. For streamflow forecasting model, Artificial Intelligence (AI) has starting to be used widely in engineering and science problems since middle of the 20th century. AI has performed significantly in forecasting and modelling non-linear hydrological applications. SVM approach has been selected as the model in streamflow forecasting for this research study.
As stated by Dragomiretskiy and Zosso [3], the multi-resolution VMD model was newly introduced lately as an alternative to an adaptive technique called EMD to overcome its limitations. According to Aneesh et al. [4], the VMD decomposed the signal into various modes or Intrinsic Mode Functions (IMFs) using calculus of variation. The VMD is able to extract the modes concurrently, properly balancing errors between them and separate tones of similar frequencies as contrary to the EMD [3]. VMD was found to be contributed widely and possessed an outstanding performance in different field of studies such as biomedical signal denoising [5][6], analysis of international stock markets [7]. For streamflow forecasting model, SVM had been classified into classification and regression analysis, which developed by Vapnik [8]. SVR has been further improved from time to time and successfully applied in different problems and prediction, and not to mention SVR has also contributed widely in hydrological field [9][10][11][12][13][14][15]. However, SVR model has its own limitations regarding the highly non-stationary of original hydrological data that vary over a range of scales although it performed quite well in flexibility of hydrological time series forecasting [16][17].
In this paper, the aim of this study is to investigate the improvement of selected streamflow forecasting model using data-preprocessing techniques for Dungun River Basin, Terengganu. Some of the specific objectives of this study are to process observed rainfall data using data-preprocessing technique as the input for the streamflow forecasting model, calibrate and validate the streamflow forecasting model using processed rainfall data from the data-preprocessing technique and the observed rainfall data, generate water level based on the processed rainfall data from the data-preprocessing technique and the observed rainfall data, and assess the performance of streamflow forecasting model with and without of data-preprocessing technique using appropriate performance evaluation.

Workflow
The streamflow forecasting using chosen time series model began with the rainfall and water level data collection from the Department of Irrigation and Drainage (DID) Malaysia [18]. The collected observed rainfall data were sorted using homogeneity test to produce a more homogenous input rainfall data. Not to mention, the normality test was conducted onto the collected observed rainfall data to decide the method of statistics to be employed. The selected data-preprocessing technique is VMD whereas the chosen streamflow forecasting model is SVM. Before the SVM procedure, the observed rainfall data were analysed using the selected data-preprocessing technique. Last but not least, the predicted water levels, which generated from SVM were assessed with the observed water level using suitable method of quantitative standard statistical performance evaluation measures.

Study area
Dungun River Basin, which is one of the seven districts in Terengganu state, Malaysia has been chosen as the study area for this research study. The location of Dungun River Basin in Peninsular Malaysia and distribution of the rainfall stations are clearly shown in Fig. 1. All the rainfall data, water level data and rating curve were obtained from the DID Malaysia for 6 rainfall stations within the Dungun River Basin in Terengganu state [18]. However, only 3 rainfall stations within the catchment have a continuous good record. A common series length of 20 years (1996-2016) of that 3 rainfall stations has been determined. The details of the rainfall stations are clearly shown in Table 1. A mean rainfall data series within the catchment (3 rainfall stations) was computed using Thiessen Polygon method. The area weighted for the 3 rainfall stations are clearly shown in Fig. 2 and Table 2.

Homogeneity test
Homogeneity tests are implemented for rainfall data of every rainfall stations before further investigating to obtain a more homogenous input data. Four homogeneity tests which applied to the rainfall data, namely Standard Normal Homogeneity Test (SNHT), the Buishand Range test (BR), the Pettitt test (PET) and the Von Neumann Ratio test (VNR).
In frequentist statistics statistical hypothesis testing, the data are tested against the null hypothesis which the data is homogenous. If the significance level, α > 0.05, means that the data are homogenous; if significance level, α < 0.05, means that the data are nonhomogenous. The criteria that used in this research study is similar to the one that proposed by Wijingaard et al. [19] which used to determine the homogeneity of data series. If at least three out of four tests reject the homogeneity, the time series is categorized as "suspect". If two out of four tests reject the homogeneity, then the time series is categorized as "doubtful". And, the time series is known to be "useful" when only one or zero test rejects the homogeneity.

Normality test
For normal distribution of data, parametric statistic tests had been implemented to evaluate the data. If the data are non-normally distributed, non-parametric statistic tests might be preferred. There are several tests that used to assess the normality of input data set. For normal distributed of data, the input data must pass all the stated normality tests. The normality tests are Shapiro-Wilk Test, Anderson-Darling Test, Lilliefors Test and Jarque-Bera Test. In normality test, the data are tested against the null hypothesis that it is normally distributed. If the significance level, α > 0.05, mean the data are normal; if significance level, α < 0.05, mean the data are not normal. In order to pass the normality test, it is necessary to turn the green light for all the normality tests.

Variational Mode Decomposition (VMD) for data-preprocessing
VMD method which proposed by Dragomiretskiy and Zosso [3] decomposes the incoming signal into k discrete number of sub-signals (modes) where each mode possesses the limited bandwidth in spectral domain. Therefore, each of the mode k is required to be compacted the most around a centre pulsation, which determined along the decomposition process. The VMD will find out the central frequencies and IMFs which located at the centre of the frequencies concurrently using an optimization method namely Alternate Direction Method of Multipliers (ADMM). The original formulation of the optimization problem is continuous in time domain.
The constrained variational problem which given by Dragomiretskiy and Zosso [3] is stated in the following equations: where is the signal, is the mode, is the frequency, is the Dirac distribution, is the time script, is the number of modes and * is the convolution. The mode with highorder represents the low frequency components.

Support Vector Machine (SVM) for streamflow forecasting
SVM which proposed by Vapnik [8], is known as classification and extended to regression. The SVR is derived from the SVM which used to solve the regression problems with SVM. The regression function of SVM that relates the input vector to the output ̂ is formulated as equation stated below.
̂= f(x) = . ɸ( ) + where is a weight vector, b is a bias and ɸ is a nonlinear transfer function that maps the input vectors into a high-dimensional feature space in which theoretically a simple linear regression can matched with the complex nonlinear regression of the input space.
The final regression function can be rewritten as the follow equation: where is the kth support vector and is the number of the support vectors.

Parametric statistical tests
For parametric statistical tests, the input data must be normally distributed. Here, the accuracy of the predicted water levels to the observed water level has been evaluated. In this research study, the models' performance is evaluated by using three chosen standard performance measures, namely Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Coefficient of Efficiency (CE).

Non-parametric statistical tests
If the input data is non-normally distributed, non-parametric statistical tests need to be implemented. As similar to parametric statistical tests, the accuracy of the predicted water levels to the observed water level has been evaluated. In this research study, the models' performance is evaluated by using four standard performance measures, namely Biased Method, Kendall's Tau B Test and Spearman's Rho Test.

Homogeneity test
The homogeneity tests were implemented on three rainfall stations to obtain a more homogenous input data. If the significance level, α > 0.05, means that the data are homogenous. It was shown that either one or zero tests reject the homogeneity for every certain month. Hence, the homogeneity of observed rainfall data series is useful. The results of homogeneity tests for three rainfall stations are clearly shown in Table 3, Table 4 and Table 5. The significance level which highlighted in orange colour stated that the respective homogeneity test rejects the homogeneity for that certain month.

Normality test
The normality tests such as Shapiro-Wilk Test, Anderson-Darling Test, Lilliefors Test and Jarque-Bera Test are applied on the observed rainfall data. The significance level, α value was shown to be lesser than 0.05 for all normality tests. Hence, the input data are nonnormally distributed. The results of normality tests for three rainfall stations are clearly shown in Table 6.

Application of data-preprocessing technique
VMD as data-preprocessing technique was carried out on the monthly and seasonal rainfall series of Northeast Monsoon (November to February) only. The training and validation sets of rainfall data are separated into 18-year and 2-year, respectively. For the period of 1996-2016, VMD method was first used to process the early 18 years of rainfall data which act as the training set of rainfall data, followed by processing the last 2 years of rainfall data which act as the validation set of rainfall data.

Application of streamflow forecasting model
SVR was used to predict water level based on the processed and observed rainfall data. At this stage, the first 18 years and last 2 years of processed and observed rainfall data were trained and validated, respectively. The predicted water level from both processed and observed rainfall data were assessed with the observed water level using certain statistical measures. The hydrographs of observed water level and predicted water level using processed and observed rainfall data for validation sets (2 years of rainfall data) are shown in Fig. 3.

Model performance evaluation
As shown in the normality test, the input data are non-normally distributed. Non-parametric statistical tests were implemented to evaluate the models' performance. The goodness-of-fit of every statistical test which used to evaluate the models' performance are shown in Table  7, for both training and validation sets of predicted water level from both processed and observed rainfall data. It can be observed that accuracy of predicted water level was shown to be improved when the rainfall data was processed with VMD. In the validation period, the predicted water level from processed rainfall data showing an improvement which obtained the better biased, correlation coefficients of Kendall's Tau B test and Spearman's Rho test statistics of -133.38, 0.319 and 0.442, respectively as compared to that of predicted water level from observed rainfall data which obtained the biased, correlation coefficients of Kendall's Tau B test and Spearman's Rho test statistics of -134.44, 0.292 and 0.405, respectively.

Conclusions
The application of VMD has been shown as a robust data-preprocessing technique to denoise the rainfall data before putting into the respective streamflow forecasting model. The accuracy of streamflow forecasting model was analysed via the non-parametric statistical tests. The accuracy of streamflow forecasting model was shown to be improved when the model was coupled with VMD. In overall, the application of VMD as datapreprocessing technique is able to improve the performance of the streamflow forecasting model.