On one algorithm of generating nonlinear regression models and its computer realization

. The tasks of statistical processing of experimental data arise in different areas of human activity. In this regard, the development of algorithms and corresponding computer programs that allow such processing and satisfy certain conditions is an important task. The present work is carried out within the framework of the aforementioned field of research; an algorithm for generating a series of nonlinear regression models of a special type is proposed, and a brief description of the corresponding computer program is given. The proposed algorithm is iterative in nature: each subsequent model is built on the basis of the previous one according to some rule. This process begins with the construction of the simplest, linear model. For all the models obtained, a number of their statistical characteristics are calculated. The corresponding computer program is written in Java. The process of building models is managed by a some way by the user working with this program. In addition, in the paper a number of illustrative examples of constructing regression models using the developed program is given. The analysis of the obtained results is fulfilled, showing the advantages of the proposed methodology in comparison with the methodology of standard linear regression. The proposed procedure, in fact, allows us to pass from some initial linear regression model obtained at the first step of research to another model of better quality, without changing the initial set of empirical data. The presence of many final models makes it possible to select the best variants from them.


Introduction
The tasks of statistical processing of experimental data arise in various fields of human activity.The goals of such studies are to detect hidden patterns, compare possible alternatives of choosing solutions to determine the best variant, predict the development of events, detect connections between phenomena or processes, etc.There are many computer programs that allow perform the statistical analysis of large amounts of data by various methods [1][2][3][4].These programs (sometimes called statistical packages) can be divided into the following 3 groups based on their functionality: universal, professional and specialized.Such a variety of statistical packages is due to the diversity of data processing tasks using statistical procedures of various types.It should be noted that in some areas of activity, both the data itself and the problem statements are so specific that special methods of analysis should be applied to them, which, as a rule, are not presented in universal packages.In this regard, there is often a need to develop algorithms and programs focused on solving applied problems of a special kind.
One of the classical problems of mathematical statistics is the problem of finding an approximate functional dependence of the form (1) of one variable y on some other variables x1, x2, …, хn using the known sets of values of all these variables (obtained, for example, as a result of an experiment).In this expression, the function F can be any function of many variables, but its general form is assumed known; the function F depends on a number of fitting parameters [5][6][7][8][9][10].
An important problem in such studies is the problem of choosing the function F in equation (1).The simplest function, which often used in this case, is a linear function of the form (ai, i=1, …, n, -are the fitting parameters).This leads to the equation of classical linear regression; to determine unknown parameters in the corresponding equation the least squares method is usually used [6,7,11].If the function F is nonlinear, then we obtain a nonlinear regression equation.Obviously, there are infinitely many ways for choosing the type of nonlinear function F, and determining the corresponding unknown parameters in it may be a non-trivial task.However, nonlinear regression models are of undoubted interest, since in many cases they allow more precisely to describe the studied phenomena [6,8,[12][13][14][15][16]. Tasks of this type are of great importance in applications and are widely used in various scientific fields (for example, in chemistry, medicine, econometrics, sociology, machine learning, etc.) [17][18][19][20][21][22].Note, that it is not always possible to construct nonlinear regression equations, which satisfy to special requirements, using the available universal statistical packages.In this regard, the development of methods for constructing nonlinear regression models satisfying certain conditions and making possible computer realization is an actual task.
In this paper, a method is proposed that allows obtaining on the basis of a preconstructed linear model a series of various nonlinear models of a special kind, which, according to certain characteristics, will be better than the initial linear model.

Statement of the problem
Let us now turn to the formulation of the main problem considered in this paper.Let there be a sample consisting of N any objects of arbitrary nature with known numerical values yi (i=1, …, N) of their certain characteristic y and known numerical values of any of their other quantitative parameters zj (j=1, …, t), t≥1.Thus, given t+1 vectors of length N.
Let's extend the initial set of parameters zj (j=1, …, t) as follows: add to them new parameters of the kind (2) ("inverse parameters").The numbers wj (j=1, …, t) are found for all N objects of the sample.If for some j the values of corresponding zj for some objects of the sample are zero, then we "shift" all the values of the parameter zj by the smallest integer a>0 such that for all objects of the sample zj+a≠0.In this case, for such index j we put The main task of the present work is to develop an algorithm and an appropriate computer program that allows building a series of sequential correlation equations of the form (4) using initial data, where the functions F are some special polynomials F0, F1, F2,… of variables x1, …, х2t (in fact, the functions F will depend only on the initial t variables x1, …, хt and will not generally be polynomials of these variables).
Next, we point out a number of conditions that the above procedure must satisfy, as well as the corresponding polynomials F. 1) This process begins with the construction of a linear equation (with the linear function F0) containing a given number k (1≤k≤N) of the best parameters from the set x1, …, х2t, selected, for example, by the method of step-by-step linear regression.The user set the desired number k.
2) Each subsequent equation of the series with polynomial Fi+1 is constructed on the basis of the previous equation with polynomial Fi , i≥0, according to some rules.
3) Let's clarify the structure of the resulting equations.Obviously, in the general case, an arbitrary polynomial F of degree М≥1 of variables x1, …, х2t can be written as follows: where are some constants (coefficients of the polynomial) and the numbers n1,…,n2t are not equal to zero simultaneously.Such a polynomial can be considered as a linear function with respect to expressions of the form . ( We require that the polynomial F(x1, …, х2t) included in equation ( 4) contains a given number of k terms of the form (5), 1≤k≤N.The user set the number k for each equation of the series.The degree M of the polynomial F can be arbitrary.
Thus, each equation of the series will be a nonlinear equation of the form (2) with respect to the initial parameters x1, …, хt, but it will be linear with respect to expressions of the form (5). Perhaps the equation will not contain some parameters from the set x1, …, х2t or terms of the form (5). 4) The user should be able at the first stage of his work to select part of the parameters from the initial set x1, …, х2t for all further constructions (this will promote to the extension of the set of equations obtained).
The described above procedure allows obtain a large number of different equations (models).In this regard, the question arises about the selection of a small number of the best (by some criteria) variants.To estimate the quality of the obtained models of the type (4) and then select of suitable variants from them, a number of their statistical characteristics can be calculated (see, for example, [19,23]).In particular, the following characteristics can be found: 1) R and Sd -correlation coefficient and standard deviation for linear correlation between the given and calculated values of y for all objects of the initial sample; 2) max (or av ) -the maximum (or average) relative error in calculating the value of y according to the resulting equation (in %) for all objects of the initial sample; 3) max (or av) -the maximum (or average) absolute error in calculating the value of y according to the resulting equation for all objects of the initial sample.
The specified requirements for the accuracy of the model are determined by the conditions for these characteristics (for example, one can require that R>0.95 or max <5%, etc.), which depends on the specifics of a considered application task.One can also evaluate the quality of the model using the cross-validation procedure [24].According to this procedure, one object is excluded from the initial set of objects, then a model is built for the remaining objects based on previously selected parameters.Next, the y value is calculated for the excluded object.This procedure is repeated for all objects of the initial set.Then a correlation is constructed between the given and calculated values of the y for all N objects.For this correlation the standard statistical characteristics are determined (correlation coefficient Rcv, standard deviation Sdcv, etc.).
If there are several models of a given accuracy, preference is given to those models that in their linear form contain the smallest number of parameters k (1≤k≤N).This is due to the fact, that the fewer parameters in a linear regression model compared to the number of objects in the initial sample, the wider its scope of applicability (see, for example, [23]).
In the present paper, an algorithm for generating a series of nonlinear regression models of a special type is suggested, and a brief description of its possible program realization is given.The algorithm is iterative in nature; the procedure for constructing models is managed in a certain way by the user.In addition, the paper provides examples illustrating the application of the algorithm to solve a number of mathematical modeling problems in the field of chemistry.The analysis of the obtained results is carried out, which shows the expediency of: (a) introducing "inverse" parameters into the above procedure; (b) construction of nonlinear models.

Description of the algorithm
Input data for solving the problem: a vector of length N with coordinates yi (i=1, …, N) (interpreted as a set of values of some characteristic y for any N objects), as well as t vectors zj (j=1, …, t) of length N (interpreted as sets of values of some parameters of the same objects).Sometimes the variable y is called a "response", and the variables zj (j=1, …, t) are called "predictors".
Step 1.Using the given parameters zj, we will determine the new parameters wj (j=1, …, t) by the formula (2); we will find their values for all N objects of the sample.If for some index j zj=0 for some objects of the sample, then we "shift" all the values of the parameter zj by the smallest integer a>0 such that for all objects of the sample zj+a≠0.In this case, we define wj by the formula (3) Step 2. Let's call the set of parameters x1, x2, …, х2t the initial set.Let's take some number k (1≤k≤2t).From the initial set of parameters by the method of step-by-step linear regression, we select k parameters xi1, …, xik, which give the best linear correlation for the characteristic y (i.e. with the highest correlation coefficient) and construct it.For this correlation, we find the correlation coefficient R, the standard deviation Sd, the maximum absolute error of calculations ∆max (with the definition of the object on which it is realized), the average relative error av (in %), the value of the Fisher criterion F.
To obtain more detailed information about the accuracy of the constructed model, for each object of a given sample the calculated value of the characteristic y, the difference between the given and calculated values y, as well as the corresponding relative error (in %) can be found.In addition, a cross-validation procedure can be carried out for the constructed model and for the corresponding correlation obtained in this procedure its characteristics such as Rcv, Sdcv, etc. are found.
The parameters xi1, …, xik included in the resulting correlation are called basic.
If the quality of the resulting model satisfies the user, then the work of the algorithm ends.If it is need to build a more precise model, then we go to Step 3.
Step 3. We form a new set of parameters consisting of the basic parameters (see Step 2), their squares and all possible pairwise products.The total number of parameters in the new set is t1=0.5k 2 +1.5k.
Step 4. Go to Step 2 and perform all the actions specified in this step, taking as the initial set of parameters a new set of parameters formed in Step 3. Now the number t1 is used as the number 2t.

Remarks
1) The resulting final model significantly depends on the sequence of numbers k selected in Step 2 (at multiple repetition of Step 2).In this regard, in order to obtain a better result, it is advisable to vary a chain of numbers k.From various final models, one can choose one or more of the best.

2)
Step 2 of the algorithm can be modified as follows: from the initial set of all 2t parameters x1, x2, …, х2t, some part of it is initially selected (the numbers of the selected parameters are indicated) and then models are built only on the basis of this part of the parameters.

Description of the computer program
Statistical processing of sufficiently large data is not possible without the use of an appropriate computer program, therefore, in this paper the brief description of the possible computer realization of the above algorithm is also discussed.The purpose of this is to demonstrate such features of corresponding computer program as: a) simple and convenient data entry; b) the user's ability to manage the process of building models; c) a variety of information about the models obtained, which allows selecting suitable variants.
The algorithm is realized as a computer program written in Java.The corresponding software package contains, in particular, the files "input", "start", "output", "modeloutput", "validation-output".

Data input
The initial data are introduced in an excel file named as "input".The values yi (i=1, …, N) are entered in column A, and the values of the parameters zj=xj (j=1,…,t) are entered sequentially in the other columns (B, C, D, ...).At the same time, in the first row, in columns B, C, D, ..., for convenience, there are numbers of the entered parameters: 1, 2, ..., t. Figure 1 shows an example of an input file with the entered data (N=19 is the number of considered objects, t=6 is the number of initial parameters).

Launching the program
The program is started using the file named as "start".Fig. 2 shows an example of the dialog window at program startup (for the data in Fig. 1).
Next, the program automatically calculates the values of the "inverse" parameters for xj, j=1, …, t (possibly with a certain shift of some of them); these parameters are further denoted by xj, j=t+1, …, 2t.Then the user enters the parameter numbers from the list xj (j=1, …, 2t), which he will use further to build models (for example, 1 4 6 9 11 from the set 1, ..., 12), indicates that he will use all the entered parameters, see Fig. 2. Next, the user enters an integer k (0<k≤N) of the desired parameters in the linear regression equation, which is constructed at the first stage of the model generation process (see Fig. 2; here k=3).After entering all these data, the corresponding linear regression equation appears on the screen, as well as a number of its statistical characteristics (see Fig. 2).Here: N -the total number of objects; p -the number of parameters included in the resulting correlation (it is possible that p<k); max.resudals -the maximum of differences between the calculated and given values of the characteristic y; MRE -the average relative error (in %); Sd -the standard deviation; R -the correlation coefficient; F -the value of the Fisher criterion.
Next, the user selects one of the options (see Fig. 3): 1) to save the result; 2) to realize the cross-validation procedure; 3) to save the current model; 4) to continue the program.
Suppose that option 4) is selected.Then, a nonlinear correlation will be constructed with respect to those parameters that were included in the linear correlation obtained in the previous step.However, first it is need to enter an integer k (0<k≤N) of the desired parameters in the linear regression corresponding to the desired nonlinear regression.After entering k, the program outputs the equation of the desired regression, as well as its statistical characteristics (see Fig. 3; here k=2; the nonlinear equation is based on the linear equation shown in Fig. 2).Further, the process of building models continues, while each following equation is built on the basis of the previous one, and each time the user enters some number k.If the model obtained at some stage of the program's work satisfies the user, then the model generation process ends.

Output data
The results of program's work are presented in the following forms.1.The equations of all obtained correlations together with a set of their statistical characteristics are displayed in a dialog window (as, for example, in Fig. 2 and Fig. 3); this information can be copied and transferred to a text file if necessary.2. At the request of the user, for any intermediate model and for each object of the sample, the given and calculated values of the characteristic y, the difference between them, as well as the relative calculation error (in %) can be written in the "output" file (see the example in Fig. 4, columns A, B, C, D, respectively).3.At the request of the user, a cross-validation procedure can be carried out for the intermediate model, with its results written in the "validation-output" file.4. In the "model-output" file, the equation of the final model is written, which allows, if necessary, to calculate the y characteristic for a new object that is not included in the initial set of objects.Note that the information about the models obtained in the form of their statistical characteristics or contained in the "output" and "validation-output" files allows you to judge the accuracy of the models on the initial data, their stability, identify "drop-out" objects and correct the strategy for finding suitable models.

Examples
We will give a number of examples from the area of mathematical modeling in the field of chemistry, illustrating the work of the proposed algorithm and the corresponding program, as well as analyze them.Using these examples, we will show the expediency of using "inverse" parameters and nonlinear correlation equations in such problems.
The values of their boiling points are given in Fig. 1 (column A).
The structural formulas of these compounds can be represented as simple graphs whose vertices correspond to the carbon atoms in the molecules, and the edges correspond to the chemical bonds between them.As quantitative characteristics of molecules, we use some invariants x1 -x6 of the corresponding molecular graphs.Here (n is the number of vertices of the graph, Т3 is the number of pairs of vertices of degree 1 separated by three edges).The parameters x3 -x6 are defined in some ways in terms of the eigenvalues of the above graphs (for example, x6 is equal to the multiplicity of the zero eigenvalue of the graph).The values of the parameters x1 -x6 are given in Fig. 1.For further modelling, the "inverse" parameters x7 -x12 are also calculated.
Using these data, we will construct a number of correlations between the property and structural characteristics of these compounds and find their statistical characteristics.Let us require, for example, that for the initial sample max≤5% (max is the maximum error in calculating the properties of compounds according to the resulting equation).We introduce the same restriction on the similar characteristic max,cv obtained in the cross-validation procedure for obtained model.We will look for models that satisfy the above constraints and at the same time contain the smallest number of parameters (in their linear form).
Note that the final model depends on the following information that the user enters in the process of building the model: 1) parameters selected initially from the set x1 -x12; 2) chains of values of the number k.These data are enough to run the program to fully reproduce the final equation and find its statistical characteristics.Therefore, in the description of the models for brevity, we will use only this information.
Table 1 describes and provides a number of characteristics of some obtained 7 models satisfying the specified constraints.In this table are given: 1) the parameters selected from the set x1 -x12 for their construction; 2) chains of values of the number k that led to the final models; 3) statistical characteristics R, Sd, MRE, max.
Table 2 shows the results of the cross-validation procedure for these models.
Let us analyze the results obtained.As can be seen from these tables, all constructed models, differing in their parameters, satisfy the conditions max≤5%, max,cv≤5%, contain exactly 2 parameters (in linear form) and are nonlinear (except for model 1).At the same time, some models also include "inverse" parameters, so their use in the above procedure is justified.Among the obtained models there are non-linear (for example, model 4), which are more accurate than linear (model 1), so the search for nonlinear correlations in such problems seems appropriate.In addition, it can be noted that the most significant structural parameter in this case is x1 (or x7=x1 -1 in the absence of x1 in the set of selected parameters).By varying the parameter sets from the set x1 -x12 and the chain of values k, other correlation equations can be obtained.Setting less stringent restrictions on the characteristics of the models allows to expand the set of suitable equations obtained.
The structural formulas of these chemical compounds can be represented as graphs with single edges, which are six-membered cycles, the vertices of which are labeled by symbols H, Cl and NH2 (while the label NH2 has exactly one vertex in each cycle).As parameters х1, х2, … we will use the numbers of certain subgraphs in the corresponding graphs.Consider the following subgraphs: -isolated vertices labeled by Cl; -chains of kind H-H-H; -chains of kind H-Cl-H; -pairs of isolated edges of the form Cl-Cl / Cl-Cl; -pairs of isolated edges of the form H-Cl / Cl-Cl; -chains of kind Cl-Cl-Cl-H; -chains of kind Cl-Cl-Cl-Cl.Let х1, х2, х3, х4, х5, х6, х7 be the numbers of these fragments, correspondingly.
Using these data, let us construct a number of correlation equations of the structureproperty relationships, find their statistical characteristics and analyze the results obtained.
Let's consider 2 ways to select parameters for building models: 1) from the initial set of parameters x1-x7; 2) from the complete set of parameters x1-x14, containing the inverse parameters for x1-x7 (note that to calculate them, all x1-x7 were "shifted" by 1).
For each variant, we will perform Step 1 of the algorithm by constructing linear correlations with k=7 parameters.Then, based on them, we will construct nonlinear (quadratic) correlations at k=7 and at k=6 (perform Step 2).We will get models 1, 2, 3, 4, 5, 6, accordingly.Table 3 for each model shows the parameters included in it, as well as its statistical characteristics R, Sd, MRE.Let us fulfill a comparative analysis of the results presented in Table 3. 1) Comparing linear models 1 and 4 according to their statistical characteristics, we conclude that the use of inverse parameters in this procedure allows us to obtain a more precise correlation.At the same time, some inverse parameters (х8 and х14) were also included in model 4. Thus, the use of inverse parameters allows to improve the results of modelling.
2) Comparing linear model 1 and nonlinear model 2 (containing exactly the same number of parameters k=7 in its linear form) by their statistical characteristics, we conclude that model 2 is more precise than model 1.Similarly, comparing models 4 and 5, we come to the same conclusion.However, it is possible to obtain models 3 and 6, more precise than 1 and 4, using even k=6 parameters.Thus, the proposed procedure for constructing nonlinear correlation equations makes it possible to improve the results of modelling obtained in a standard way within the framework of linear regression analysis.
We also give as an example the equation corresponding to model 6 from Table 3 (without specifying the values of the numerical coefficients a0 -a6, and taking into account that x8 and x14 are inverse parameters for x1 and x7, respectively):

Conclusions
In the present paper, an algorithm for generating a series of nonlinear multiparameter regression models of a special kind is proposed, and the corresponding software package implemented in Java is also briefly described.The algorithm for constructing models has iterative nature: models are built sequentially, and each subsequent one is based on the previous one.The procedure for building models is managed in some way by the user.For the models obtained in the process of the program work in the form of some equations, a number of their statistical characteristics are determined.This allows us to evaluate their quality according to various criteria and then choose the best models.It is also possible to carry out the procedure of cross-validation of the constructed models, as well as to calculate on their basis the characteristic y for new objects not included in the initial set of objects.
The proposed procedure, in fact, allows us to pass from some initial linear regression model obtained at the first stage of research to another linear model of better quality, without increasing the number of parameters included in it, as well as without involving additional predictors for its construction.
In addition, in the paper some illustrative examples of constructing models of the structure-property relationship of organic compounds using the developed program are given.A comparative analysis of the obtained models is carried out, showing the expediency of introducing inverse parameters into the above procedure, as well as constructing nonlinear models.

Fig. 1 .
Fig. 1.An example of the input file "input" with the entered data.

Fig. 2 .
Fig. 2.An example of the dialog window when the program starts and when the number k is entered for the first time.

Fig. 3 .
Fig.3.An example of the dialog window at the second introduction of the number k.

Fig. 4 .
Fig. 4.An example of output data in the "output" file.

Table 1 .
Basic information about some models of the structure-property relationship obtained in Example 1.

Table 2 .
Results of the cross-validation procedure for the models from Table1.

Table 3 .
Basic information about some models of the structure-property relationship in Example 2.