Development of Stochastic Mathematical Models for the Prediction of Heavy Metal Content in Surface Waters Using Artificial Neural Network and Multiple Linear Regression

. The principal purpose of this study is to build stochastic neuronal models, for the prediction of heavy metal, contents in the surface waters of the Oued Inaouen catchment area of the TAZA region, according to their Physico-chemical parameters; we have carried out a comparative study: the multiple linear regression (MLR) method and the artificial neural network (ANN) approach. The following statistical indicators were used to evaluate the performance of the stochastic models developed by neural network and MLR: The sum of the quadratic errors (SSE) and the determination coefficient (R²), also through the study of fit graphs. The results show that the predictive modelling using artificial neural networks is very effective. This performance shows a non-linear relation between the studied Physico-chemical characteristics and the heavy metal contents in the surface waters of the Oued Inaouen catchment area.


Introduction:
The sustainable management of natural resources and the protection of our environment is considered in the environmental charter of our country Morocco [1], which is an introduction to the development and management of nature's resources. Water pollution can result from the contamination of wastewater, various discharges of industrial and domestic products, etc., among which metals occupy a very privileged place, as they are widely used by man in several fields.
All these discharges, in particular discharges of toxic micro-pollutants, i.e., heavy metals, cause water pollution with all the risks that this entails in terms of hygiene, biological life and environmental protection.
Indeed, water transports these heavy metals and inserts them into the food chain (algae, fish, etc.). These metals are most often present in trace amounts, but their toxicity develops through bioaccumulation in organisms. As a result, they are becoming an increasingly worrying problem [2].
The toxicity of heavy metals has led public authorities to regulate emissions by setting limit levels. Aware of this problem, the Moroccan public authorities and the national Corresponding author: rachid.elchaal@uit.ac.ma scientific community are increasingly interested in environmental studies, with a view to assess risks and protect our ecosystem [3].
The present study proposes to develop mathematical models for the prediction of heavy metal contents (zinc, Copper, and manganese) from a particular environmental parameter. The investigation concerns the surface waters of the Oued Inaouen catchment area.
To achieve this modelling objective, a comparative study is being conducted between two prediction models, namely multiple linear regression, and artificial neural networks, using the Statistica10 software.

Database:
Our database consists of 100 samples (observations) of surface water taken in the province of Taza. From 2014 to 2015, the collection, transport and conservation of water samples refer to the protocol and procedures defined by the National Drinking Water Office. A part analysis a part was done there, and another part was made at the Regional University Interface Center (CURI) laboratory supported by the University Sidi Mohamed Ben Abdellah (USMBA) of Fez.
The database distribution is as follows: 70% of the data samples, selected randomly from the entire database, for the training phase of a forecast model of the dependent variable.
The remaining 30% of the samples were used to verify network performance while training the network and to avoid over-learning. The aim is to test the predictive validity and effectiveness of these models (15% for the test and 15% for the validation).

Data Formatting [4]-[6]
The normalization is a method of preprocessing data that helps reduce the complexity of models. The input data (16 independent variables) are raw, untransformed values. They have very different orders of magnitude. To normalize the measuring scales, the data are converted into a standardized variable. Indeed, the values of each independent variable (i) have been normalized to its means and standard deviation according to the relation: ( ) The purpose of normalizing the values for all variables is to avoid very large or minimal exponential calculations and to limit the increase in variance with the mean. The values for the dependent variables were also normalized to the range of values [0;1] to meet the requirements of the transfer function used by the neural networks. This normalization was performed using the following relationship: With: : Standardized value. : Original value.

Data modelling techniques
Modelling approaches that have been developed to establish the best relationship between Manganese(Mn), Zinc(Zn) and Copper (Cu)and significant elements are based on two methods [7]: MLP type Networks of artificial neurons.

Results and Discussion
The results reported in Table 1 are found by the STATISTICA neural network tool. They show the determination coefficients, the number of iterations and the activation functions of the two layers depending on the topology; the algorithm used is Broyden-Fletcher-Goldfarb-Shanno (BFGS). We modified the architecture of the network, by playing on the total quantity of hidden neurons and, or the number of training cycles (number of iterations). To do this, we have consecutively modified the number of hidden neurons (1, 2, 3, ..., 15). The results showed that the minimum mean square error, the number of iterations and the maximum determination coefficient are reached when the number of hidden neurons equals eight for Cu; 11 for Mn and 15 for Zn, for the best predictive model of heavy metal concentrations. Figure 1 describes the training of the network in the case of Copper. After 102 steps, the required result is attained. With eight neurons in the cache layer, the two curves (learning error and test) converge correctly. The network has been trained to the point of overlearning; this, phenomenon has been encountered in 102 iterations.
The models established by neural networks allow improvements of up to 82% in the explanation of variance compared to those found by multiple linear regression. It is 77% for Manganese, 82% for copper and 59% for Zinc. The coefficients of determination increase respectively from 0.224 to 0.99, from 0.172 to 0.99 and from 0.40 to 0.99. The results mentioned in Table 2 show that the models established by the ANN are better than those found by the MLR method. Indeed, the coefficients of determination calculated for the models established by the ANN are significantly higher (above 0.9), in comparaison the coefficients calculated for the models established by the MLR are lower (between 0.17 and 0.40). This can be easily seen by referring to Fig.3 and the following Table 2. Table 2. Coefficient of determination was obtained between the observed heavy metal contents predicted by the MRL and the ANN.
Based on the determination coefficients found when checking the validity of the models developed by the ANN are close to those relating to learning. On the other hand, the determination coefficients relating to testing the validity of the models concerning MLR are somewhat different from those obtained during training. This shows that the concentrations of heavy metals in the waters of the Oued Inaouen catchment area are linked to the Physico-chemical characteristics of the environment by non-linear relations. What is commonly found in the aquatic environment [8], [9] Fig.2 shows the ANN architecture we used for copper prediction. It is a topology network [16-8-1]. .   Fig.3.b shows the relationships between observed and estimated heavy metal contents for the models developed by the ANN and MLR methods. They show the performance of the ANN method, which is due to the high coefficient of determination of 0.99 observed for Manganese and the two other metals (Copper and Zinc). By comparing the models established by the MLR with those found by the ANN, we can see that the latter are the most efficient. This can easily be seen by referring to fig.3   The study of the prediction of water-soluble heavy metal concentrations from physico-chemical parameters revealed that the predictive models developed by the current method, which is founded on the ANN technique, are more efficient than those found by the MLR method. This performance appears to be caused by the fact that the concentrations of heavy metals in the waters of the Oued Inaouen catchment are related to the Physico-chemical characteristics of the water by non-linear connections. This leads us to consider, in the future, developing more aspects concerning this work. We propose to predict heavy metal contents by including other parameters in our database and to work with different types of neural networks such as the RBF (radial-based function) type to try to generalize these forecasting models to different Moroccan aquatic environments.