Modeling and forecasting tasks of agriculture based on machine learning

. Continuous advances in computer technology have provided good support for the expansion of agricultural research using machine learning. This article considered the current problem of yield forecasting using methods and algorithms of machine learning to support management decision-making in the agricultural sector. For a set of data collected from five districts of the Issyk-Kul region, such as weather conditions, soil characteristics and pre-processing of the sowing area, a study of the yield of various crops using advanced machine learning algorithms, such as the support vector method, k-nearest neighbors, variants of gradient boosting and random forest, etc., is demonstrated. To assess the accuracy of the models, a comparative analysis with the results of multiple regression was carried out. It is shown that powerful regression machine learning algorithms like k-nearest neighbors (KNN), random forest (RF), support vector method (SVR) and gradient boosting (GBR) give tangible results in prediction compared to other machine learning methods (MAPE=10%). The calculation results showed the effectiveness of using algorithms with ensemble methods to solve the problems of yield forecasting, and that environmental factors (weather conditions) have a greater impact on yield than soil genotype.


Introduction
Food security plays a key role worldwide and is an important lever in the success of a country's economy as a whole. In this case, its main component is the yield of crops. Yields found in many agricultural projects belong to complex categories in modeling and forecasting tasks. It is a trait that, in general, can be defined as an amalgamation of many factors of nature and the natural conditions of the environment where a particular crop is grown and requires comprehensive research. Accurate crop yield prediction requires a fundamental understanding of the functional relationship between crop yield and factors related directly to specific natural phenomena like temperature, humidity, and precipitation, as well as indirectly affecting yields. As indirect factors, the characteristics of the sowing areas, the quality and composition of the soil treated with fertilizers, and the results of seed studies of plant genotype for the region under study can be taken into account. At the same time, it is very important to have reliable data sets, as well as powerful algorithms for modeling and predicting such processes in order to identify the relationship.
To build a yield model, it is also necessary to establish other relationships, for example, with the phenomena of climate change, which is currently typical for many regions of Kyrgyzstan. For example, despite the dry year of 2021, many regions of Kyrgyzstan were characterized by a shortage of irrigation water, abnormally hot days. But in Issyk-Kul region in the period from August 15 to September 15, 2021 there were incessant heavy rains, ambient humidity, favorable temperature conditions, coincided with the period of fruit bearing and ripening of fruits, rapid growth of plants and greatly affected the yield of crops, especially potatoes, fruit trees and other crops. At the same time, along with the quantity of agricultural products, the quality of products obtained from these areas has improved greatly. This factor of nature suggests that weather conditions, rather than genotype or consideration of soil characteristics, play an important role for yield.
In this direction for wheat yields in [1] considered, the application of k-nearest neighbors and the decision tree. In [2], using naive Bayesian process and k-nearest neighbor algorithms to the data of companies on the development of new crop varieties, the prediction problem was considered. In [3], based on the random forest algorithm and linear regression, the problem of forecasting, taking into account regional and global features, when choosing to grow certain species of plants was solved. To study rice yield in [4], using the method of reference vectors obtained the results of forecasting, taking into account the topographic features [5] of the area with subtropical climate based on algorithms of decision tree, logistic regression and k-nearest neighbors. In [6], using methods of naive Bayesian process, random forest, neural networks, decision tree, and support vector machines, the classification problem for yield problems was investigated. In [7][8], yield prediction was investigated using neural network analysis of fruit images. Specifically, the recognition problem was solved using image segmentation for fruit detection and yield estimation in apple orchards. Proper agronomic farming with the use of fertilizers, continuous observation and timely identification of the needs of agricultural plants, plays a key role in obtaining the desired yield was studied in [9].
Research in recent years shows that the results of a set of algorithms applying ensemble learning are attractive because ensemble methods attempt to construct a set of hypotheses and combine them for use to many applied problems, including yield prediction problems [10,11,12,13].
There are many factors that affect crop yields, such as the area planted with proper pre-tillage, the effective use of irrigation systems and its improvement, take into account changes in weather and characteristics of rainfall and temperature. With all of the above conditions, this paper investigated crop yields using machine learning methods and algorithms.
The main emphasis was placed on strong factors affecting yields such as, tillage and basic weather conditions. In collecting data, these factors were taken into account individually and regionally. When forming the database, the peculiarities of soil characteristics of the regions under study were also studied. In our case, soil acidity and composition were selected for the northern coast, eastern and southern regions of the Issyk-Kul region for each separately. When creating forecasts, the scale of the main agricultural regions of the study area was taken into account.
When solving the problem of yield forecasting, the most commonly used algorithms were considered, such as: Logistic regression (LR). As one of the most widely used classification algorithms, logistic regression shows satisfactory performance at relatively low computational cost. The logistic function for logistic regression with several variables can be written as follows (1): for splitting the nonlinear decision boundary in classification problems [14], where x ij and x i'j are the i-th pair of observations of the j-th feature, m is the number of features, λ is a tuning parameter that accounts for the smoothness of the decision boundary, and K ( ) is , a kernel function.
Random Forest (RF). One of the most advanced and powerful machine learning algorithms, Random Forest is an ensemble learning-based machine learning classification algorithm. This algorithm uses multiple decision trees to perform the classification task. In random forest classification problems, all constituent decision trees are weak learners, the outputs of these weak decision trees are combined, and make the final prediction of all these classifiers as a voting classifier. Random Forest is essentially a tree-based method that has robust and good prediction performance by combining a large number of decision trees to produce a single consistent prediction. The main feature of the random forest is that it is not allowed to consider most of the available features in every partition of the tree [15].
Note that the resulting classifier, such as the classifier a(x), is defined as = ( ) . Simply for the classification problem, we choose the solution by majority vote, 1 =1 ∑ ( ) and in the regression problem, we choose the mean. Classifiers (x) constructed as a decisive tree generated from a sample using bootstrap. It is recommended to take m=√ in classification tasks, and m= in regression tasks, where n is the number of features.
3 Gradient Boosting. As a branch of ensemble methods, boosting is a way of combining the performance of a number of weak classifiers to create a powerful "voting classifier", so it is considered a strong classifier. As an example, consider the prediction (4) given by Discrete AdaBoost, which is described as follows: where (x) is the weak classifiers, which gives either positive or negative predictions, is the coefficient calculated using training weights, M is the number of weak classifiers, the sign function here returns either positive or negative predictions, and F(x) is the corresponding prediction. By combining multiple models, the boost method can provide better prediction performance than a single model. The basic boosting procedure is to fit a sequence of weak learners (e.g., discriminant analysis, k-nearest neighbors, decision tree, etc.) to weighted versions of the training data. In this paper, three popular boosting algorithms, namely Discrete AdaBoost, LogitBoost, and Gentle AdaBoost, are chosen to investigate the problem. The data were extracted from the sources of five districts of the Issyk-Kul region of the Kyrgyz Republic and represent the fertilizer (nitrogen, phosphorus and potassium) applied and cultivated fields, temperature, humidity, rainfall, soil acidity, for five names of districts were used, four names of soils (two districts have the same soil data) belonging to each district, and potato yields for each district. For the convenience of working with Python libraries when collecting data, the peculiarities of the data of each region were taken into account and was presented in the form of a generalized .csv file. Pre-processing of yield data, showed that all factors obey the normal law of distribution. Correlation matrix, shown in Figure 1 of the database under study without dummy factors and distributed as follows: Here we can see that the dependent variable harvest -yield -does not correlate with the other independent variables. The dependent variable yield as an initial representation in the form of a multiple regression model has the following form (5): where Y is the dependent variable of yield, is independent variables composing the multiple regression, is unknown coefficients, and ε is the model error. ω The algorithm for determining the unknown coefficients in (5), which are part of this ω model, can be analytically determined from the minimum of the following quadratic functional (6): where is the predicted value for the exact value . Here, from the minimum condition of the functional (6) =0, the vector (7) is defined ∂ /∂ω , (7) ω → = ( ) −1 → However, the algorithm (7) is unstable to small changes in the input data, i.e. the problem in this case is incorrectly posed. The model created by this algorithm will be retrained. The trained model in this case begins to interpolate the data instead of extrapolating, and it loses its generalizing ability, i.e. generalizing the model on the test data will not work. To solve this problem, we introduce some regularizing function R( ) [16]. Let us now formulate the ω → problem of determining the weighting coefficients with the transformed error functional ω → in the form (8): where is called the regularization coefficient, E is a unit matrix. In the implementation of λ this problem (8) we usually use (9)

Discussion
In the calculations performed below for the retrained models, this technique was applied everywhere. Here are the results of applying machine learning algorithms, as components of some ensemble, to the study of regression models, as well as their estimates of accuracy on yield prediction. The resulting yield regression results using machine learning algorithms are shown below in Table 1 and as a visualization of the yield calculations in Figures 2-3. Table 1 shows that the algorithms of gradient boosting with MAPE =10.14% and random forest with MAPE =10.24% gave the best results than the other algorithms. The leader of the prediction was the support vector method with a result of 10.12%. Below are the results of model building using eight machine learning algorithms to solve yield regression problems. The most proven regression model algorithms for machine learning were used. The mean_squared_error, r2_score, mean_absolute_error, and max_error libraries from the corresponding sklearn.metrics library were used to evaluate the model. We relied on the following classes to analyze the process of regression models: lin = LinearRegression(), dtr = DecisionTreeRegressor(), sgd = SGDRegressor (loss='squared_loss'), gbr = GradientBoostingRegressor(), knn = KNeighborsRegressor(n_neighbors=5), rfr = RandomForestRegressor(), svr=SVR(), and xgb=XGBRegressor(). The results of the evaluation of models built with machine learning algorithms are presented in Table 1.  The remaining results using four random forest machine learning algorithms, XGBoost boosting, K-nearest neighbors, and the reference vector method showed the result -average absolute percentage error MAPE = 10% deviation from the exact prediction. Here is a visualization of the prediction data representation in the form of the following regressions. Figure 3 shows the results of applying machine learning to the yield regression problems of the following algorithms random forest, k-nearest neighbor in the case of five nearest neighbors, XGBoost boosting, and support vector method.
As can be seen from the results, the intensity of data arrangement around the diagonal function is most densely located in the stochastic gradient descent, random forest, gradient boosting and support vector method. It is known that the more negative the MAE, the better, and a perfect model is available at MAE=0. The estimate of the mean absolute deviation of the MAE in all of our cases, ranges around 2. Shown below are some calculated model performance estimates for neg_mean_absolute_error (NMAE) with regression=5: In this case, the ensemble result can give a tangible result for prediction. Next, we used ensemble algorithms, or as it is called Lazy Prediction, for the task of predicting crop yields with multiple, ensemble components. Here it is appropriate to note that the constituent algorithms can be any. As shown below, using this library with a special selection of parameters of machine learning algorithms, we calculated the performance of the 10 most powerful standard classifications or regression models on several performance matrices. Calculations on ensemble algorithms for the regression problems below are computed in Google Colaboratory.
As the next task, the number of fictitious functions -features was expanded to study the features of soil characteristics for these five regions. For the new updated database, expanded district soil data were included. Four cases of the ensemble algorithm were considered. The following ensembles of algorithms and VotingRegressor results were implemented to analyze the voting algorithm, which gave the following results, as shown in Table 2: Except for the latter case, where there are more constituent algorithms of the voting algorithm, the results turned out to be close to each other. Using this database for advisory purposes, a web-based system has been created for farmers to study, which predicts from climatic data and relevant soil characteristics of the area, what to grow and what crops are essential for the region. The results showed that machine learning algorithms with test data from a particular area, specifying the relevant parameters of the region, give an appropriate prediction. The next two machine learning algorithms, gradient boosting (XGBoost) and Random Forest, performed well in the tests. For example, the following test data was used for the selected area: , where the Numpy arrays contain data reflecting the weather conditions and soil characteristics of the area, the result of the calculation in both cases showed the forecastto plant potatoes (potato). Table 3 below shows the results of calculations using two algorithms for this problem.  As can be seen from Figure 4, the highest prediction result of the random forest algorithm with a prediction result of 0.990909 (99.09%).
The paper shows the effectiveness of using machine learning algorithms for modeling and forecasting agricultural problems. Research on data from five districts of the Issyk-Kul region on the yield of potatoes and other crops, using machine learning, showed the effectiveness of identifying the main characteristics to build models and forecasts. In the databases collected on the yield of potatoes and other plants about 5 thousand subjectsfarmers in Issyk-Kul region were taken into account important factors of the use of weather conditions in these regions, as well as taking into account the indicators of acidity, which in the selected regions ranges mainly from 7 to 7.5 and other characteristics of the soil. The temperature regime for these five regions, due to the influence of Lake Issyk-Kul, fluctuates around the average monthly temperature by a value of plus minus 4-5 degrees. For the indicators of precipitation and humidity, the data were taken into account according to the meteorological observations of the republic. Calculations based on machine learning algorithms gave the results of the deviation of the exact and predictive data by the values based on Table 1. Figures 2-3 show the results of applying machine learning algorithms to regression problems in the form of a graphical representation, the potato yield indicator for the region as a whole, with different accuracies. Figure 4 shows the performance of the XGBoost algorithm model. For yield prediction the best performance was obtained, accuracy score with average absolute error MAPE = 10.2% for the random forest algorithm. In this case, the intensity of data accumulation around the regression line, as seen in Figures  2-3, is the greatest. It can be noticed that other algorithms like, reference vector method, gradient binning, nearest neighbor method are close with MAPE values to the result of random forest. The paper also studied ensembles of algorithms using the Lazy Prediction library. Table 2 shows the results of applying the ensemble method to yield prediction problems. The top ten models were selected and results were obtained. Table 3-4 shows the results of applying machine learning algorithms to identify the most acceptable crops according to certain test data, according to the localization of the crop area and soil properties. The results of the evaluation of VotingRegressor ensemble algorithms for model building using Random Forest, AdaBoost, and GradientBoostingRegressor for an extended database that takes into account localization of areas and soil characteristics are obtained.
For a set of data collected from five districts of the Issyk-Kul region, such as weather conditions, soil characteristics and pre-treatment of the sowing area, in the work, the task of predicting the yield of various crops using advanced machine learning algorithms, as the method of reference vectors, to -nearest neighbors, options of gradient boosting and random forest was studied. For comparative analysis, model accuracy estimates were compared with multiple regression results. The effectiveness of algorithms for solving the problems of yield forecasting is shown. The calculation results showed the efficiency of the used algorithms using ensemble methods. At the same time, the requirements for parameter selection were removed for the set of constituent ensemble algorithms. This study has also shown that the developed models can also be successfully used to predict yield for multiple plants. What has not been realized is the application of deep neural networks to this particular case. The accuracy of the raw data and the nature of the choice of data collected should most fully and realistically reflect the environment of the crop being grown, which in general, is a costly way to collect big data. The convenience of using machine learning is identifying some of the hidden complex and non-linear features of the factors affecting yields for the selected region. Further research for the selected regions is the collection of data with deeper factors such as minimum and maximum temperature, daylight hours, soil characteristics, soil composition of the region and the use of data from seed companies to create new crop varieties to predict future yields. The issue of climate change, the risks associated with debris flows, hail warnings through weather forecasts, which are not uncommon for this region of the phenomenon, increased solar activity, the duration of abnormally hot days and low water levels remain untouched. All of these factors lead to soil erosion and weathering. According to the authors [17,18,19,20,21], these studies in the future should use deep learning technologies for big data with local conditions. In general, the calculations and results obtained in the work showed the completeness of the research using machine learning algorithms for the tasks of modeling and predicting yields using region-specific data.