A New Nonparametric Algorithm for Preprocessing Stochastic Data with Uncertainty

. The article deals with the problem of modeling stochastic processes under uncertainty. The peculiarity of the processes under consideration is that the researcher does not have information about the mathematical structure of the object; the object is represented as a black box. The article proposes to use a nonparametric modeling algorithm based on a nonparametric estimate of the regression function on observations. To improve the accuracy of modeling, it is proposed to use an algorithm for generating training samples. The algorithm differs from the previous modification by the definition of essential variables. The results of computational experiments have shown the effectiveness of the proposed algorithms.


Introduction
The problem of data analysis and modeling of stochastic processes is relevant for many subject areas, including in the case of processing geoinformation data [1]. In the case when the mathematical structure of the object under study is unknown, one of the solutions can be the use of nonparametric estimates [2]. As it is known, the accuracy of nonparametric models depends on many factors. One of the key factors is the quality of the initial data [3]. In the original data there may be outliers due to measurement errors and missing data due to different discreteness of control [4]. These features can negatively affect the quality of modeling, and in some cases lead to unsatisfactory results. Another feature is the uneven sparse distribution of the sample elements in the space of input and output variables [5]. In this case, resampling algorithms [6] or bootstrap methods [7] can be used. If the sample sizes are small with a large space dimension, it is proposed to use methods for generating training samples. However, the previously proposed methods for preprocessing sparse samples cannot be applied in conditions of noisy data (in the presence of outliers), as well as in the presence of a large number of variables, including insignificant ones. Modification of existing algorithms for obtaining training samples will expand the conditions for the applicability of these methods and improve the accuracy of modeling.
The rest of the article is organized as follows. In Section 2, we present the problem statement. Section 3 contains a proposed algorithm our, and in Section 4 we analyze the results of our simulation study data. Discussion and conclusions are presented in Section 5.

A problem statement
Consider the proposed process modeling scheme presented in Figure 1 [8]. The following designations are accepted here: x(t) is an output process variable, u(t) is an input vector action that can be controlled and measured. Random action is ξ(t), control blocks of input and output variables are G u and G x . Random noises g x (t), g u (t) have zero mathematical expectation and limited variance. Measurements of input and output variables form a sample of observations , which is given to a data analysis block. This module contains data preprocessing algorithms for training sample generating.
The modeling task is not only to obtain an estimate   х t of the output process variable x(t), but also to process the initial data.

A new modification of nonparametric algorithm for data preprocessing
As a model, we use the well-known nonparametric Nadaraya-Watson (NW) estimator [9,10]: (1) where u = (u1, u1, …, um ), cs is the bandwidth or smoothing parameter, which is unique for is a kernel function.
The estimate belongs to the class of local approximation, the accuracy depends on the number of observations that are in the cs-neighborhood of the point under consideration. An algorithm for generating additional observations was proposed earlier [11]. However, if there are outliers in the sample, the result will be unsatisfactory. In addition, all variables were previously used to calculate the estimate (1) in [12]. As it is known, the inclusion of insignificant variables in the model can lead to a decrease in accuracy. Therefore, it is proposed to identify significant variables and only use them both for calculating the estimate and in the case of generating a new training sample The paper proposes to modify the algorithm as follows: 1. Outliers identification and treatment using nonparametric method [13].
2. Determination of smoothing parameters [  where the index i is the observation number in (1). The Nelder-Mead multidimensional optimization method can be used as an optimization algorithm. This method allows you to work with nonsmooth and noisy functions. 3. Among all the found coefficients [  As seen from Table 1, for all sample sizes, the kernel estimators NWN in case of usage the training sample obtained with the proposed algorithm of data preprocessing have smaller MAPE values than the NWS kernel estimator. In each case, it is seen that NWS has the best performance.

Conclusion
In the paper, we have studied the new modification of data preprocessing algorithm. The results of the simulation study, which was performed to evaluate the performances of the kernel estimators considered, showed that the use of training samples obtained with the proposed algorithm has improved the accuracy of modeling.