Characteristic Selection and Prediction of Octane Number Loss in Gasoline Refinement Process

In the refining process of gasoline, accurate prediction of the octane number loss is conducive to production management to ensure the octane content in gasoline. Therefore, the relevant research has important theoretical significance and application value. Aiming at the characteristics of octane number loss with few samples, high dimensions and non-linear of the octane number loss, this paper uses maximum information coefficient, recursive characteristic elimination and random forest regression algorithm to select the main characteristics, and establishes the octane number loss prediction model based on least squares support vector machine respectively. Compared with the three algorithms of support vector machine, BP neural network and ridge regression, the experimental results show that the two models of ridge regression and least square support vector machine have higher prediction accuracy, but the least square support vector machine has the best effect.


Introduction
Octane number (RON) is a digital expression to measure the anti-knock combustion ability of gasoline in automobile cylinder. It is one of the most important indexes to measure the quality of gasoline. The higher the value is, the better the its anti-knock performance is. An accurate prediction of RON loss can help production managers to understand production status and adjust production plan in time. RON loss data has the characteristics of few samples, high dimension and nonlinearity. Generally speaking, not all the characteristics of such data play a decisive role in the target characteristics. Therefore, it is inefficient and inaccurate to predict the target characteristics directly [1],and it is necessary to select the main characteristics.
In order to improve the prediction efficiency and accuracy of target characteristics, the NOx emission prediction model based on deep cycle neural network is given in reference [2]. This model plays very well in both fitting effect and prediction accuracy, and can be effectively applied to the prediction of NOx concentration at the outlet of flue gas denitration system. A reaction kinetics model of MIP process based on structure oriented aggregation is proposed to predict and optimize the effects of outlet temperature and mass ratio of catalyst to oil on the reaction process [3]. In reference [4], support vector machine is used to predict obstructive sleep apnea in largescale clinical samples in China, which provides a simple and accurate method for early identification of OSA patients.
In this paper, we use the maximum information coefficient, recursive characteristic elimination, and random forest regression [5] algorithm to select the main characteristics that affect RON loss, and use the least squares support vector machine model to predict and evaluate RON loss.

Related work
The main work of this paper includes: selecting the main characteristics of RON loss; establishing the prediction model of RON loss based on least squares support vector machine; testing the model. Figure 1 shows the process framework of characteristic selection and prediction of RON loss.

Characteristic selection
The so-called characteristic selection is to select meaningful characteristics, establish relevant data sets and input them into machine learning model for training. It is usually considered from two aspects. One is whether the characteristic is divergent, and the other is whether the characteristic is related to the target variable. The characteristic with high correlation should be given priority as the characteristic of the target variable [6].
According to the form of characteristic selection, characteristic selection methods can be divided into three kinds: filter, which scores each characteristic according to divergence or correlation, sets the threshold or the number of threshold to be selected to select characteristics; Wrapper, according to the objective function (usually the prediction effect score), selects several characteristics each time, or excludes some characteristics; Embedded method, which uses some machine learning algorithms and models for training to get the weight coefficient of each characteristic, and select characteristics according to the coefficient from large to small [7].
Characteristic selection is the process of selecting the best characteristic for the target variable from the original data. That is, for given characteristics, we select a subset of the best characteristics to improve the performance of machine learning in predicting the target variable. In this paper, three methods, maximum information coefficient (MIC), recursive characteristic elimination (RFE) and random forest (RF), are used to select the main characteristics that affect RON loss.

Maximum information coefficient (MIC)
The maximum information coefficient (MIC) is one of the filtering methods. It is used to measure the degree of correlation between two variables and, linear or nonlinear strength, and is often used in feature selection of machine learning. According to the nature of MIC, it has universality, fairness and symmetry to assist Pearson correlation coefficient in feature selection [8]. The concept of mutual information should be used to calculate the maximum information coefficient. The calculation formula of mutual information concept is shown in (1).
where ( , ) i j p x y is the joint probability between variables X and Y. The calculation of joint probability is quite troublesome. To solve this problem, the calculation formula of MIC is given, such as (2). 2 [ ; ] max log (min( , )) where B is a variable to be selected. It is selected in this paper to be is the 0.6 power of the amount of data.

Recursive feature elimination (RFE)
Recursive feature elimination (RFE) belongs to the packaging method. Feature recursion does not only look at the direct association between X and Y, but also judges whether the feature is suitable from the final performance of the model after adding the feature. Figure 2 is the flow chart of recursive feature elimination backward search algorithm. Firstly, all features are input into the model for classification, and the best features are selected. Then, the above selection work is repeated among the remaining features until the last feature classification is completed, and the best feature set is selected [9].

RFE model All features
Whether it is the best feature Feature Feature … Feature

Random forest (RF)
The basic idea of random forest (RF) is to synthesize the regression results of multiple classification and regression tree (CART) to get the final result [10]. Firstly, random forest algorithm generates n sub datasets by self-help resampling, then randomly selects K features from these sub datasets, establishes CART, and repeats the above steps m times. Finally, the results of all decision trees are summarized and the weight is calculated, and the one with the highest weight is selected as the final result.

Establishment of least squares support vector machine (LS-SVR) model
Support vector machine, also known as support vector network, is a new machine learning method developed on the basis of statistical learning theory, which has the advantages of complete theory, strong adaptability, global optimization, short training time and good generalization performance [11]. Least squares support vector machine (LS-SVR) is a kind of support vector machine, which is obtained by transforming inequality constraints in standard support vector machine algorithm into equality constraints.
In LS-SVR, the optimization problem of support vector machine is replaced by equality constraint. Error variable is introduced to each sample i e , and the 2 L regularization term of error variable is added to the original function. The optimization formula of least squares support vector regression is shown in (3). 2 2 The square term of loss function is used to replace the insensitive loss function of support vector machine, and the inequality constraint with relaxed variables is replaced by the equality constraint with error variables. Usually, the dual problem can be obtained by using Lagrange multiplier method, and the Lagrange multiplier formula is introduced, such as (4).
In this paper, the least square loss function is selected as the optimization index, then the function of LS-SVR model is (5): where  

model evaluation index
After training and testing the model, we need to evaluate the model in order to judge the quality of the model. The model is evaluated from five aspects: mean absolute error (MAE), mean absolute percentage error (MAPE), mean square error (MSE), root mean square error (RMSE) and R2 coefficient of determination.

software platform
The experimental operating platform is Ubuntu 64 bit operating system, python3.7, the compiler IDE is Spyder, the drawing platform is windows 10 64 bit operating system, and the drawing software is Visio 2013 and Adobe Illustrator (AI).

data sources and data processing
The data of this paper comes from Annex 1 and Annex 3 of question B in 2020 China graduate mathematical modeling competition. Data samples are collected from FCC gasoline refining unit. Annex 1 contains 325 data samples. Each sample has 354 operating variables and 14 characteristics of raw material property, product property, spent adsorbent and regenerated adsorbent, with a total of 368 characteristics. Annex 3 contains 285 and 313 data, with only 40 sample sizes and 354 operating variables.
First of all, we need to preprocess the data of 285 and 313 in Annex 3, and then add the processed data of 354 operation variables to the corresponding sample number in Annex 1. Then we need to normalize the data of 354 operation variables of 325 sample data in Annex 1. Because there is a large difference between different characteristics, the calculated relationship coefficient will have a large difference, which can not be used in the optimization process Quickly find the optimal solution of the model. In this paper, we use the maximum minimum normalization method to map the features that affect RON loss to   0,1 , by the formula (7). min ' max min , 1, 2,..., , 1, 2,..., .
ij ij where min x is the minimum value of ij x , max x is the maximum value of ij x , m is the data sample size, n is the feature dimension of the data. The top 29 features extracted by the three algorithms mentioned above are shown in Table 1.  Through the above three algorithms, the top five features of each method and a total of 4 common features are selected. A total of 19 features are selected. It is known from references [12,13] that sulfur content and alkane content also have certain influence on octane number loss.

Experimental results of feature selection
Eight characteristics including RON loss are selected from 14 characteristics of raw material property, product property, spent adsorbent and regenerated adsorbent, so 27 characteristics are finally selected, as shown in Table 2.

Experimental results of octane number loss prediction
Feature selection is to provide data support for subsequent RON loss prediction, standardize the selected features, and then divide the 325 samples in Annex 1 into data, the first 265 samples as training set, and the last 60 samples as test set.
The BP neural network model is used for 15000 iterations, and the batch size is 16, as shown in Figure 5, where epochs represents the training times and loss represents the loss value. It can be seen that the loss value of model training and verification is infinitely close to 0, and tends to be stable.   Figure 7 shows the prediction of octane number loss by using support vector machine. It can be clearly seen that the actual value and predicted value of the first 265 training samples basically fit, and the actual value and predicted value of the last 60 test samples do not fit well. Fig.7 The prediction of octane number loss by SVR Figure 8 and Figure 9 show the RON loss prediction results by ridge regression and least squares support vector machine algorithm respectively. It can be seen that the actual value and predicted value of the first 265 training samples are almost completely fitted, and the actual value and predicted value of the last 60 test samples are well fitted.

Evaluation index experimental results
In order to verify the quality of the model, it is verified from two aspects. Firstly, the least squares support vector machine uses linear kernel function and radial basis function kernel function to compare the mean absolute error (MAE), mean absolute percentage error (MAPE), mean square error (MSE), root mean square error (RMSE) and R2 determination coefficient; and four different algorithms are evaluated. Fig.8 The prediction of octane number loss by Ridge Fig.9 The prediction of octane number loss by least square support vector machine Table 3 shows that linear kernel function is better than radial basis function in predicting RON loss.  Table 4 and table 5 show that the least squares support vector machine is better than the other three models in predicting RON loss.   Figure 10 shows that the R2 values of training set and test set in ridge regression and least squares support vector machine are larger, indicating that these two algorithms are better, but the R2 score of least squares support vector machine is 1% higher than ridge regression. Through the statistics of the above model and kernel function evaluation index, we know that the least squares support vector machine using linear kernel function to predict RON loss is better than radial basis function kernel function, and the most prominent is that the R2 of RBF is only 14%; and the accuracy of LS-SVR model to predict RON loss is higher than other algorithms, and its R2 reaches 99%.

Conclusions
(1) The main characteristics of octane number loss were selected by maximum information coefficient, feature elimination and random forest regression algorithm.
(2) Octane number loss is predicted based on least squares support vector machine.
(3) Through the comparison of the evaluation index LS-SVM with support vector, BP neural network and ridge regression model, the results show that LS-SVM has better experimental effect.