Research on Pump Inspection Cycle Early Warning Method Based on Big Data

： At present, the existing indicator diagram can only be used for expost judgment and can not give early warning, and the influencing factors of pump inspection period are nonlinear, multi constrained and multi variable. In this paper, big data machine learning method is used to carry out relevant research. Firstly, around the influencing factors of pump inspection cycle, relevant data are collected and the evaluation index of pump inspection cycle is designed. Then, based on feature engineering technology, the production parameters of oil wells in different pump inspection periods are calculated to form the analysis sample set of pump inspection period. Finally, the early warning model of pump inspection period is established by using machine learning technology. The experimental results show that: the pump inspection cycle early warning model established by stochastic forest algorithm can identify the pump inspection status of single well, and the accuracy rate is about 85%.


Introduction
During the operation of the pumping well, the production efficiency of the pumping pump is reduced due to the influence of various factors such as paraffin deposit, sand production and heavy oil production, so it is necessary to solve the problem of down-hole faults of pumping wells by means of workover and pump inspection [1] . There are two ways of pump inspection for workover of pumping wells, one way is to check the pump in a planned way, the other way is to be forced to check the pump, at present, it is mainly the latter method. Even though the indicator diagram diagnosis technology can realize the working condition analysis of the oil well pump, can not give advance warning for some situations of pump inspection such as rod break, pump leak and pipe leak and so on. At the same time, in order to improve the production condition of the pumping well, it is necessary to optimize the working parameters of the pumping well and take maintenance measures, including checking the pump, adjusting the parameters, cleaning and preventing the wax, washing the well, flushing and preventing the sand, cleaning and preventing the scale. Various measures occur alternately in equal or non-equal cycles, which affect oil wells together, resulting in multi-event, non-linear, multi-constraint, multi-variable characteristics of oil well production data. Due to the interaction of various factors, and the complex of the influence degree and relationship, the problem of accurate early warning of pump inspection cycle has not been solved, which directly affects the optimization and implementation of production and operation management. The early warning of pump inspection cycle in pumping wells has become a key problem to be solved urgently in the oil field.
The related research on pump inspection period can be divided into three types. The first type is to extend the pump inspection period from the technological point of view [2][3][4][5][6] , such as: taking gas-proof measures for gas-affected wells, adding anti-eccentric wear technology at the bottom of sucker rods; Anti-corrosion coupling of sucker rod with bidirectional protection. The second kind is to optimize the pumping unit system by simulation technology [7][8][9][10] . The third type is cut in from single technique or single factor [11][12][13][14][15][16][17][18][19] , such as the effect of vibration load on eccentric wear of Rod and tubing, the effect of water cut and submergence on eccentric wear of Rod and tubing, the effect of swabbing parameter adjustment on pump detection rate of oil well, etc.
The problems and techniques of pump inspection described in the literature mainly focus on the traditional methods of single pump inspection cause, sometimes comprehensive pump inspection strategy and treatment methods, and little research based on big data artificial intelligence method. With the development of data science, machine learning technology provides a new idea and method for intelligent evaluation of petroleum exploration and development. Based on the concept of data-driven, this paper first collects relevant data around the influencing factors of pump inspection cycle, establishes the evaluation index of pump inspection cycle, and carries out the construction of big data of pump inspection cycle. Then, it uses feature engineering technology to calculate the production parameters of oil wells in different pump inspection cycles, forms the sample set of pump inspection cycle analysis, and uses random forest algorithm to establish the early warning model of pump inspection cycle, so as to achieve the early warning of single well production status.

Establishment of evaluation index of pump inspection cycle
The main factors that affect the pump inspection of are fatigue strength, wear of tubing and Rod, corrosion, Wax Scale, leakage, break off, working system of oil well and so on. These factors are ultimately represented in the data. In the era of big data, the dynamic analysis data of these oil and water wells provide the most basic data support for the intelligent analysis and prediction of pump detection cycle. According to the influencing factors of pump inspection cycle and the data status of the project, the evaluation index of pump inspection cycle is designed. It mainly includes production data index and work diagram data index. Because the production Data and work diagram data belong to time series, and the amount of data is large, it will increase the complexity of the model by directly using time series for machine learning modelling. In this paper, the characteristic parameters such as mean, median, range, variance, standard deviation, coefficient of variation, crest, trough, skewness and Kurtosis are extracted.

Tab.1 Evaluation index of pump inspection period
Skewness reflects the asymmetry of production index-time curve in pump inspection period. The formula is as follows: where the μ is the mean value, σ is the standard deviation, in the actual calculation, we replace μ with its sample value.
Kurtosis reflects the sharpness of production index-time curve and is a statistic to describe the degree of steepness and slowness of all value distribution. Kurtosis is defined as the fourth standard moment, which is formulated as follows:

Sample construction of pump inspection cycle
If we extract the peak, tough, trend, cycle and other characteristics for each time series index, then a sample of pump inspection cycle will have hundreds of features, high-dimensional data often contain observations of irrelevant or redundant features. Therefore, before modeling, it is necessary to detect the correlation of the feature parameters and delete the redundant feature parameters. Correlation analysis is an important index to quantify the consistency degree of variation among different factors. The degree of correlation between sample factors is quantified using the correlation coefficient, which is a numerical value between [-1,1] and the larger the absolute value of the Corr, the higher the correlation between the different factors was, a negative number indicates a negative correlation (the value of the factor changed in the opposite direction) and a positive number indicates a positive correlation (the value of the factor changed in the same direction) . In this paper, the Pearson Correlation Coefficient is used to calculate the correlation between the parameters. Assuming that there are two variables X and Y, the Pearson Correlation Coefficient between the two variables can be calculated by the following formula: Where E is the mathematical expectation, Cov is the covariance, and N is the number of variables. In this paper, we delete the parameter with correlation greater than 0.8.
The characteristic parameters of pump inspection cycle data in the whole plant are calculated, and 776 samples are finally generated through vacancy value deletion, correlation analysis, redundant attribute deletion and other methods, as shown in the table below.

Early warning model of pump inspection cycle
Machine Learning Algorithms play an important role in big data analysis, including classification and prediction, clustering, association analysis and hundreds of other algorithms, such as random forest, support vector machine, neural network, etc. Due to the different data distribution characteristics, the applicable algorithms are different among different data sets. How to choose a suitable machine learning algorithm to obtain the best prediction accuracy is a cycle of processing, including repeated selection of models, parameters and analysis of effectiveness.

Parameter selection
There are many algorithms for parameter selection, the most common of which are the algorithms of random forest feature importance assessment, including MDI (Mean Decrease Impurity Important) and MDA(Mean Decrease Accuracy Important).
(1)Mean Decrease Impurity Important (MDI) Mean Decrease Impurity Important (MDI): indicates the average reduction of the error per feature. In each split of each tree, the improvement of the split criterion is an important measure of the split variables, which are accumulated separately for each of the trees in the forest. The decision tree divides the node into two sub-nodes according to some rules. Each split is aimed at a feature that minimizes the error. The error can be calculated using mean square error, Keeny purity, information gain, or some other metric set as needed. In sk-learn, the enhancement effect of each split is weighted by the number of samples arriving at the node, and the significance of the feature is normalized.
(2) Mean Decrease Accuracy Important (MDA ) Mean Decrease Accuracy Important (MDA ) disrupts the order of the eigenvalues of each feature and measures the effect of changes in the order on the accuracy of the model. This ingenious method uses out-of-pocket data to calculate importance. The OOB data is part of the training set, but is not used to train this particular tree. Calculate the fundamental error using OOB data, then randomly scramble the order for each feature. For the non-important features, the effect of the disorder order on the accuracy of the model is not too great, but for the important features, the disorder order will reduce the accuracy of the model.
After repeated calculations of the importance of the parameters, 35 feature parameters were selected, they are production time _ total cumulative volatility, the daily oil production _total accumulative volatility, total stroke, the back pressure _ total accumulative volatility, daily oil production _ the mean value of cumulative volatility, daily liquid production _ total accumulative volatility, production time _ the mean value of cumulative volatility, back pressure_ total accumulative volatility ,daily total jig frequency _ variance, dynamic liquid level _kurtosis, downward current _variance, water cut _ total accumulative volatility, production time _ kurtosis, water cut _ total accumulative volatility, daily liquid production _ coefficient of variation, deflection of transfer load, C total effective accumulative volatility, pump depth _ skewness, production time _ skewness, water cut _variance, water cut _ skewness, daily oil production _ skewness, jig frequency _ the minimum, daily fluid production _variance, back pressure _ interquartile range, daily oil production _ mode, downward current _ skewness, gas-oil ratio _ skewness, pump efficiency _ skewness, maximum load mode, dynamic fluid level _ the minimum value.

Data set partitioning
Machine learning sample set is generally devided into traning set and test set, The former is used to train the parameters of the model and the latter is used to test the generalization ability of the model. In the process of data mining, because of various reasons, we can not collect all the sample data, so we need to sample. There are many methods of sampling, such as random sampling, stratified sampling, systematic sampling, cluster sampling and so on. In the course of this project, stratified sampling is used to divide the population into several homogeneous layers, and then random sampling or mechanical sampling is used in each layer. Stratified sampling is characterized by combining scientific grouping with sampling, grouping reduces the effect of the variability of each sampling layer and ensures that the samples are representative enough. Stratified sampling can be divided into general stratified sampling and stratified proportional sampling according to different sampling methods in homogeneous layers. The general stratified sampling is to determine the sample size of each layer according to the sample variability, more samples with large variability, less samples with small variability. Stratified proportional sampling is usually used when the variability of the sample is not known in advance.
Using the method of stratified sampling, 776 sample sets are divided into training sets and test sets according to the ratio of 8:2, of which 620 are training sets and 156 are test sets.

Random forest modeling
Random forest is a classifier ensemble learning algorithm, which has good effect for regression and classification problems. The random forest algorithm is based on the decision Tree Algorithm, which divides the sample space recursively to form a series of recursive IF-THEN rules and organize them in the tree structure according to the data features in the training set. Using the idea of stochastic simulation, N random decision trees are constructed to form "forest", and the final prediction is made by synthesizing the results of the decision trees in "forest". The idea for the random forest algorithm is illustrated in the following figure. The 35 parameters above were used as input characteristics, and the random forest tree algorithm is used to establish the model. Through repeated iterations, the average accuracy of the model is about 0.85, as shown in the figure below. Confusion Matrix, also known as error Matrix, is a standard format for precision evaluation. It is expressed in the form of N-row and N-column Matrix, which is mainly used to compare the classification results with the actual measured values. Each column of the confusion Matrix represents the prediction category, the total number of columns represents the number of data predicted for that category, and each row represents the true attribution category of the data, the total number of data per row represents the number of data instances for that category. The confusion Matrix of the model is shown in the following

Conclusion
This paper presents the application of big data technology in the early warning of pump inspection cycle analysis, including the establishment of evaluation index, sample building technology, machine learning modeling technology, etc. Because oil data is a complex system, which spans a large range of time and space, data quality, data integrity, data noise and data set imbalance, etc. , the construction of machine learning data sets is facing a huge challenge. Because of the limitation of the sample, the sample may not be able to cover all the pump inspection, there will be a phenomenon that the model accuracy is high, but the application accuracy is low. This phenomenon will be gradually improved with the application of the model.