Intellectualizing analysis and assessment of statistical data on field pipeline failures due to internal corrosion

. The paper presents some results of a study meant to create an expert system used to select a material design and method of internal corrosion protection for field pipelines at the design stage. The author has performed an intellectual analysis of operating statistical data on field pipeline failures. The first part of the paper describes the initial sample and the exploratory analysis performed. The second and the third parts describe the processes of creating and assessing a classifier based on the Random Forest algorithm. To assess the quality of the classifiers, the author has calculated the shares of correct answers in the algorithm (accuracy), precision and recall, as well as the F1-score. The author makes a conclusion about satisfactory values of quality metrics and outlines areas for further research.


Introduction
Oil and gas companies use field pipelines with various design specifications operated under the influence of various internal and external factors during oil and gas production, primary treatment and transportation. These internal and external factors directly influence the emergence and development of internal and external corrosion processes in some sections of field pipelines.
This work is connected with the development of a software prototype for an expert decision-making support system (hereinafter referred to as the "System") to be used to choose a material design and method of internal corrosion protection for field pipelines. The goal of this study is to conduct a sequential intellectual data analysis to create an example of predictive model that can be used as the main component of the System.

Data mining of operational statistics
Current work is concerned with statistical data on the failures and operating conditions of various field pipelines. The initial sample comprises 641 fields in the Russian Federation and contains 109,989 examples of field pipeline failures (data lines). These failures occurred in 2000-2019 at pipelines commissioned in 1938-2018. This analysis is done in the Python software environment.

Exploratory data analysis
Basic information on the causes and nature of failures in the original sample is presented in table 1. Current work is carried out on failures due to internal corrosion. Most of failures were due to internal corrosion. Also the causes are unknown (NaN) for 14356 failures. These failures are also accepted in the current study as we can assume that most of these failures are due to internal corrosion. The training of the predictive model will be performed based on the causes of failures that have a different failure nature.
The result of the predictive model will be the mean time between failures caused by internal corrosion. The feature "Nature of failure" cannot be considered in the model, since it is unknown until the moment of failure. A preliminary assessment used to involve or reject specific features is based on the following rules: The feature is specific at the design stage and known before actual failure occurs.
The feature characterizes the operating reliability of the inner wall of a test target (field pipeline section).
The feature characterizes the internal operating factors of the test target: hydrodynamic parameters, physical and chemical properties of pumped media.
For comparison, the author presents information on some features not involved in the analysis: "Failure coordinate," "Manufacturing plant," "Construction contractor", as these features are unknown at the design stage.
Organizational and economic data, as these features are not connected with operating reliability.
It should be noted that failures caused by factors not connected with field pipeline inner wall reliability are excluded from the initial table. At the same time, the study comprises the following features: "Type of outer coating," "Type of heat insulation coating," "Laying method," "Type of soil," and "Laying depth." These features influence the temperature of pumped media and, consequently, lead to the emergence and development of corrosion processes [1].
The preliminary analysis restores the values of features that can be determined definitely or with some assumptions, for example: Water content (in percentage) and zero values of oil flow rate for water and gas pipelines meant to transport dry gas Zero values of gas flow rate for pipelines pumping gas-free oil or water.
Mean time to failure [2] for each unique field pipeline section is taken as a target value.
The initial sample comprises more detailed information on the physical and chemical properties of pumped media: Ca 2+ LQ ZDWHU PJO ɋɈ2 in water PJO GLVVROYHG Ɉ2 (mg/l), H2S in water and oil (mg/l), Mg 2+ in water (mg/l), suspended particles (mg/l), total dissolved solids (mg/l), etc. At the same time, involvement of all these features leads to a reduction in the sample size to 2,143 examples of failures, provided data lines with missing values are excluded. Therefore, practically all physical and chemical properties of pumped media are denoted through the "Group of corrosion contour (GCC)" feature. This feature takes values from 1 to 4 and testifies to the overall level of medium aggressiveness.
The importance of features ( Fig. 1) is assessed using the Random Forest (RF) algorithm [3]. Category features are coded with a numerical method using the LabelEncoder function [4]. To check the adequacy of this approach, importance is assessed for standardized [4] and initial continuous features. At the same time, both resulting importance diagrams are practically identical (the deviations of the importance of each feature do not exceed 1%).
The diagram (Fig. 1) shows the features that are the initial data for the model. To predict the MTBF using the System, the user must enter the values of these attributes and additionally indicate the year of putting the pipeline section into operation.
The importance diagram shows adequate results: Input and output pressure of a field pipeline section (P_in, P_out), as well as the flow rate of gas and oil phases (F_gas, F_oil) are among 5 most significant features.
Section length is also highly important, as, in general, the probability of a failure in the longest section is higher vs. that in a short section.
Water content and temperature of pumped media are also highly important, as they directly influence the emergence and development of corrosion processes on a field pipeline inner wall.
The importance of outer and heat insulation coatings exceeds the importance of inner coating, since over 90% of the pipelines under consideration do not have inner coating, while over 50% of the field pipelines in operation use outer and heat insulation coatings.
The standardized value of operating pressure (P_op) provides additional support, as one should not be confident in the truth of averaged input and output pressure values.
The influence of chemization parameters is insignificant due to the fact that about 90% of the field pipeline sections are not exposed to inhibiting. With the exclusion of any features, the general dynamics of the distribution of importance is unchanged. Therefore, this diagram can also be fully considered to determine the significance of operating factors.

Model creation and history matching
The prepared sample contains 19623 examples of failures (rows) and 28 features (columns). At this stage, categorical features are coded with a binary method (one-hot encoding) [4], since numerical coding may lead to incorrect interpretation of category features by the model, which will ultimately affect the quality of the classifier performance.
The implementation of the predictive classifier is based on the RF algorithm that is easy to use, has good accuracy and is also widely used in other research studies [5]. The Gini index is used as a dispersion criterion [6].
The best model parameters are searched using the RandomizedSearchCV and GridSearchCV functions [4].

Classifier assessment
For quality assessment, the prepared data frame is divided into training and test sets with a ratio of 80% to 20%, respectively. The total number of examples of failures (19623) allows us to define this strategy.
As metrics for assessing the quality of classifier, the shares of correct answers of the algorithm (accuracy), precision and recall, as well as the F1-score [7] were determined. The values of the metrics are presented in table 1. The transition from interval classes to detailed classes is more convenient for perception and operation in the System, but the metrics indicators are lowered. Information is provided in table 2.  (Fig. 3) is built for a detailed assessment of model performance quality (vs. interval classes). The confusion matrix shows large (over 10), medium-sized (from 5 to 10) and insignificant (under 5) groups of classification errors. The overwhelming majority (14 out of 16) of the large groups of errors falls at adjacent classes, testifies to the similarity of the features of these failures and, consequently, to the similarity of the features of the field pipeline sections under study. A similar conclusion is relevant to the two other large groups of failures falling at close but not adjacent classes: 9...10 and 13...14; 17...18 and 23...25. It should be noted that combining some adjacent and closest classes will almost completely eliminate the erroneous answers of the classifier.

Summary
The intellectual analysis of statistical data has allowed developing a classifier meant to determine a mean time to failure for designed field pipeline sections based on historical operating data. Its satisfactory results are confirmed with the calculated quality assessment metrics. The confusion matrix indicates the similarity of features in adjacent and close classes. This problem may consist in the fact that over 80% of the data were excluded from the initial sample, as they contained missing values that could not be unambiguously identified and restored. Therefore, further research in this area will primarily be focused on the restoration of missing values. Also, a comparative analysis of other nonlinear classification algorithms will be performed in addition. It should be noted that the good classifier quality metrics show that the current model can already be applied to the System software prototype being developed now. Besides, the current results can be applied in relation to the sections of field pipelines currently in operation to optimize reconstruction and modernization schedules.