Determination of soil-water retention curve: an artificial intelligence-based approach

. Soil Water Retention Curve (SWRC) is a fundamental relationship in unsaturated soil mechanics, knowledge of which is essential for determining major mechanical and hydraulic properties of unsaturated soils. There are several empirical, semi-empirical and physically-based models which have been proposed to date for estimating SWRC. While the physically-based models which employ the basic soil characteristics such as grain-size and pore-size distributions are regarded superior to purely empirical models, their Achilles’ heel is the several simplifying assumptions based on which these models are constructed, thereby, restricting their applications and influencing their accuracy. Given the complexity of the soil porous structure, one may resort to the new inference techniques rather than mechanistic modelling to find the relationship among soil physical characteristics and the retention properties. Therefore, an alternative approach to purely empirical relationships as well as physically-based and conceptual models for determining SWRC is the use of Artificial Intelligence (AI) based techniques to acquire a relationship for SWRC based on the soil basic properties, especially grain size distribution and porosity. Among AI-based methods, Multi-Gene Genetic Programming (MGGP), often used to establish a close form equation for a complex physical system, offers a suitable alternative to the current approaches. In this study, a database compromising of 437 soils (containing various soil types, namely, sand, clay, silt, loam, silt loam, clay loam, sandy loam, sandy clay loam, silty clay loam, silty clay, and loamy sand soils) was used along with MGGP to establish a relationship among suction, saturation, porosity and grain size distribution. The proposed equation had a reasonable agreement with the experimental data compared to the other grain-based and


Introduction
Often used to establish equations for the major mechanical and hydraulic characteristics of unsaturated soil such as permeability, shear strength, and volume change, soil water retention curve (SWRC) plays an indispensable role in the mechanics of unsaturated soil, knowledge of which is required, as a core element, for any hydromechanical model. Soil water retention is determined experimentally through various methods such as pressure plate, Buchner funnel, tensiometers, and pressure membranes. Laboratory measurements of the soil water retention curve are, however, time-consuming, costly and laborious, particularly for fine-grained soils. That is why in the last couple of decades, several studies have been performed to present various techniques to simulate, estimate and predict soil water retention curve. The numerical methods used either to estimate or predict SWRC can be grouped into the following categories [1]: 1. Empirical equations whose parameters are found by fitting to the experimentally measured values, e.g., [2,3]. 2. Correlation of the parameters of the equations of the first category to the soil physical characteristics such as dry density, and soil particle size distribution, e.g., [4,5]. 3. Physically-based techniques which utilize a conceptual model such as the bundle of capillary tube to determine pore size distribution based on the grain size distribution, analytically and then formulate the soil water retention curve. 4. Simulation techniques such as discrete element method (DEM) and pore network modelling (PNM) [6][7][8]. 5. Water content at different suction values is correlated to various soil characteristics such as porosity, and an index representing grain size distribution (e.g., effective diameter, D 10 ); the correlation is obtained, often, by regression techniques [9,10].
6. Various Artificial Intelligence methods (AI) such as neural networks, evolutionary polynomial regression (EPR), and genetic programming have been resorted to, to find the equations of soil water retention curve [11][12].
All of the techniques suffer from certain limitations which have raised the need to perform the current study. In fact, the first category of the techniques requires the experimental datapoints. In the second and fifth category of the studies, mostly a certain mathematical form for the equation of soil water retention has been assumed whose parameters are, then, found through the regression techniques. One major factor is the number of soils used for the regression which can limit the generalization of the obtained equations for different soils. Moreover, one has to assume the mathematical form a priori.
In the third category of studies, the conceptual model employed to find the relationships often includes very simplifying assumptions. While the problems pertinent to the use of bundle of capillary conceptualization (used in the 3 rd category of studies) such as lack of connectivity between the tubes have been dealt with and alleviated by using simulation techniques such as discrete element and pore network modelling (fourth category of the methods), models in the fourth category of methods are commonly computationally demanding and are not straight forward as the use of a closed form equation. Furthermore, the direct use of pore network models needs assumptions about pore structure such as pore size distribution and throat size distribution which are often obtained from imaging techniques, that are not always available. The use of discrete element for soil water retention determination requires a pore network extraction step in which tessellation techniques are often employed. Tessellation techniques, however, in some cases result in dividing large pores into many small pores creating unrealistic throats.
While various artificial intelligence-based techniques have, therefore, been used to address the aforementioned shortcomings, the methods such as neural network often do not provide the user with an explicit equation. In recent years, genetic programming, particularly, an advanced form of it, the so-called Multi-gene genetic programming (MGGP) has been emerged in the computer science which can be considered as a powerful tool to extract the relationship among various variables in a complex system. Given the complexity and the number of factors affecting the soil water retention curve, one option could be the use of MGGP where no mathematical expression is assumed beforehand to obtain the relationship between the soil water retention and soil physical characteristics such as porosity and grain size distribution.
Therefore, the main purpose of this paper is to establish an AI-based explicit pedo-transfer function for SWRC prediction by using the Multi-Gene Genetic Programming algorithm (MGGP) utilizing a relatively comprehensive database. In order to make the spectrum of the soils to which the obtained equation can be applied as wide as possible, several soils whose hydraulic data as well as grain size distribution were present in SoilVision (2002) database [13] were used (437 soils). Eventually, an equation relating water content to suction based on the soil grain size distribution and porosity was obtained. The equation was then compared to the other typical physico-empirical equations [14].

Materials and methods
In order to use MGGP, a database consisting various soil types was compiled. The next sections describe the database, the MGGP approach, and then the selected physico-empirical techniques used for comparison, respectively and finally the model performance criteria are presented.

Database collection and variable selection
It is now well established that SWRC is dependent on various soil physical characteristics, such as grain-size distribution, porosity, and water content. Hence, suitable variables representing these characteristics had to be selected for MGGP. Furthermore, such data should be readily available from laboratory works reported the literature or from present databases. In order to represent grain size distribution, five grain size diameters (D 10 , D 20 , D 30 , D 50 , and D 60 ) were considered. The reason to select these diameters was their presence for all 437 soils selected from the Soil Vision Database [13]. These effective diameters along with porosity, as well as suction were input variables for the MGGP, and the volumetric water content reported at each suction was considered as the output. The compiled database comprised of a wide spectrum of soil types (i.e., sand, clay, silt, loam, silt loam, clay loam, sandy loam, sandy clay loam, silty clay loam, silty clay, and loamy sand).
In order to prepare data for AI-based model, 75% of the database is chosen as the training dataset, and 25% is the testing data that is selected randomly from this database. The training samples are used for formulating the model, and the test data samples are used for authenticating the extrapolation ability of the model. The descriptive statistics of the input and output variables are presented in Table 1.

Multi-Gene Genetic Programming and its implementation
MGGP is an artificial intelligence-based technique which is tailored to obtain explicit formula through solving symbolic regression problems and is based on the same principles of Genetic Programming (GP). GP is known as one of the evolutionary programming algorithms, inspired by the Darwin's theory of "survival of the fittest"; it employs the elements of the evolution process in nature in order to find the optimal solution [15]. GP functions the same as Genetic Algorithm (GA); however, the apparent difference is that GA is utilized for parameter optimization, while the GP is developing a tree-based structure in various lengths denoting a certain mathematical equation on which the structural optimization approach is applied to obtain the best explicit symbolic form. At first, GP algorithm generates the initial population of solutions at random. This process is performed by randomly combining the elements from function and terminal set. The function set elements can consist of arithmetic operators, nonlinear functions, and Boolean operators. The terminal set is composed of two input process variables and random constants. Composing the function and terminal sets leads to a tree-shape exhaustive set of solutions [16].
In order to find the best individuals of the generated population, the performance of the GP model must be evaluated. It is measured using the fitness function which can be various error measures such as RMSE, in order to compare the predicted GP model values to that of the actual values. The fitness function should be minimized to attain the best performance, considering the train and test results. This process has a termination criterion that controls the procedure of performance evaluation. The termination criterion is defined by the user, and it can be any of the maximum number of generations, a limitation on run time, or an error threshold for the fitness function. Until the termination criterion is met, the genetic operators are implemented on the selected individuals of the initial population with the aim of evolving a new population. The genetic operators include processes such as crossover, mutation, and reproduction. They aim to increase the diversity of the new population and consequently increasing the chance of capturing the global optimum and the best possible structure. The tournament selection method is employed to select the individuals for applying the genetic operators. It chooses the individuals with the lowest fitness values and different size to reproduce or copy to the next generation. In the crossover operation, two individuals are selected randomly, and a subtree branch of each one is swapping to produce the new children. With this approach, the mutation operation is replacing the subtree branch of an individual with the new randomly generated GP model to create a new child. The intensity of the genetic operations depends on their probability values, reported in Table 2.
GP and MGGP have the same procedure in finding an explicit model; however, the prominent feature of MGGP against GP is that the MGGP model participating in the evolution process is composed of several trees/genes where genes are randomly selected and combined whose coefficient of participation in the combination (weights) are found via the least square method. In this research, the open source MGGP toolbox, called, GPTIPS 2.0 [17] was employed. The best MGGP model was selected based on training data. The maximum number of depths, genes, population size, generations, and other MGGP input parameters are reported in Table 2.

Grain-size based physical models: Arya and Paris Model
The Arya and Paris (1981) model conceptualizes each particle size fraction i with a single capillary tube of radius r i . For each capillary tube, its radius, is obtained from the following [14,18]: where, R i is the representative particle size in i th particle size class, M i is mass fraction of the i th particle size class, ε is porosity, and ρ s is the density of the solid particles.
The exponent α is a semi empirical coefficient [14]. The constant values of this coefficient for different soils introduced in [19], and implemented in soil vision software were used in this study.
The matric potential, ψ i , is then obtained based on the capillary rise (Young-Laplace) equation as follows: where, σ is the surface tension, β is wetting angle of water menisci, ρ w is water density, and g is the gravitational acceleration. The volumetric water content, θ i , for the matric potential ψ i is obtained from summing up the mass fractions: where, θ s is the saturated water content. It should be noted that the index j denotes the particle size fractions associated with the water-filled pores and i corresponds to all the particle size fractions.

Measures of the model performance
The measures of performance utilized for the model selection were mean absolute percentage error (MAPE), root mean squared error (RMSE), square of correlation coefficient (R 2 ), multi-objective error (MO), sum squared error (SSE), mean squared error (MSE) as described hereafter: where, M i is predicted and A i is actual values, i M and i A are the average predicted and actual values respectively and n is number of data samples.

Summary and the concluding remarks
In this study, an advanced tool from artificial intelligence-based techniques was utilized to obtain a new pedo-transfer function for soil water retention. For this purpose, a database consisting of 437 soils was compiled and used. The new pedo-transfer function had a reasonably well agreement with the experimental datapoints and had a better performance as compared to Arya and Paris (1981), a typical model from the physically-based techniques. Although the obtained equation seems to be long and complex, it is a closed form equation which can easily be implemented in any excel worksheet or software to obtain the soil water retention curve. The sole input parameters are the five grain size diameters obtained from the grain size distribution curve and porosity. Ongoing study of the authors is devoted to enhance the presented equation for the fine-grained soils by considering additional variables.