The Risk Quantification Evaluation Strategy for the Distribution Line based on EMLR

. The electric power distribution grid directly serves the ordinary users, the stability and reliability of which are crucial to production and life. The traditional operation and maintenance of the distribution grid is based on experience and cannot scientifically predict the state of line faults. In this paper, stepwise regression is used to screen out 8 key influencing factors of line faults, and principal component analysis is performed to find the risk value of each line; so that establishes multiple linear and exponential regression equations for the line risk value. According to the existing data, the risk of the line is predicted, and the GA-BP neural network model is built so as to accurately predict the risk value of the line.


Introduction
In the Chinese electric power grid, the scheme overhaul mode is always adopted for the maintenance and management of power infrastructure. The overall situation is a post-maintenance system. Once the overhaul is started, it will be unable to supply energy for a long time, and will seriously affect social life. Literature [1] proposes a distribution network fault monitoring method based on big data analysis, using Cmeans fuzzy clustering algorithm to identify distribution line load patterns and perform load forecasting. Literature [2] introduces GIS technology to locate each tower and distribution transformer position in real time, so as to escort the normal operation of 10kV distribution network lines. Literature [3] analyzes the source and type of "dirty data" based on the characteristics of big data from distribution and consumption, so that proposes a corresponding data cleaning method. The actual calculation results verify the effectiveness and accuracy of the prediction. Literature [4] mainly analyzes and summarizes the current research status of machine learning algorithms used to process big data, including the problems faced by machine learning research institutes in the big data environment. Finally, it points out the research trend of big data machine learning.
In summary, at present, there are few studies on predicting electric power distribution grid line faults based on big data technology, so that there is a lack of universal line fault predicting models. Therefore, in this paper the data mining is analyzed to extract the key factors that may lead to line failures. In addition, a distribution grid line failure prediction model is established through standardized processing, principal component analysis and a combination of multiple linear and non-linear regression algorithms.

Calculate risk value of the route
According to the collected failure data on the 10kV distribution line from a power company within two years, 14 factors are obtained, which seem to lead to the fault of distribution grid line in this paper. Standardized stepwise regression is used to screen out the factors. After standardizing independent variables and dependent variables, stepwise regression is used to calculate the regression coefficients and their significance of each independent variable so as to eliminate variables with non-significant coefficients. The independent variable coefficient obtained after regression is called the standardized regression With the collected data, the fourteen factors are taken as independent variables and the number of line failures are as dependent variables. Then z-score standardization is carried out for independent variables and dependent variables respectively, so that the coefficient is achieved through stepwise regression. The standardized formula is as follows:

Failure cause factors
It is not all of the 14 factors that has a significant influence on the occurrence of failure. Only eight factors such as the number of defects, passes the significance test. Although other factors may also have a certain influence on the occurrence of the fault, the influence of those factors is so small that will be ignored. The key factors are obtained after those factors with less influence are removed.
By sorting the key factors according to the absolute value of their standardized regression coefficients, the importance of the eight key factors can be achieved. The number of defects has the greatest influence on the number of faults and the insulation rate has the least influence on the number of faults. All of the other factors have a positive correlation with the number of line failures except that the operating years has a negative correlation with the number of line failures.

Principal component analysis method
Assuming that there are n samples and p indicators, then an n×p sample matrix x can be formed.
Constructing the correlation coefficient of the x matrix as R. ¦ . According to the contribution rate of each principal component, the principal components can be sorted and some of them can be selected so as to achieve the purpose of dimensionality reduction. Taking into account the difference in the dimensions of the data of each variable and the inconsistent units, the method of normalizing each variable is adopted to eliminate the influence of the dimension. The normalization formula is (6): After the data is normalized, the principal component analysis is performed in order to obtain the contribution rate of each principal component. According to the calculations, the cumulative contribution rate of the first 7 principal components has reached 96.21%, which enable to reduce the calculation dimension, so that only the first 7 principal components are selected for subsequent calculations.
After calculating the value of each principal component, the contribution rate of each principal component is set to construct a comprehensive evaluation function so as to calculate the risk score of each route and obtain the comprehensive score of the principal component of each route line as Table 2.  Each variable in the principal component analysis is a positive indicator, that is the larger the variable value is, the greater the risk of the route. The risk value score of the route is calculated through the above steps, and it can also be considered that the larger the value is the risk of the route is greater. Where, ( ) i f X is the output value calculated by the model, and i y is the actual value of the training set. In this paper, the iterative method is adopted to minimize the value of cost F .

EMLR simulation and result calculation
Since the risk score calculated by the principal component analysis method has a negative value, the following formula (10) has to be used in so that obtains the risk value. The risk_value is used as the actual value of the dependent variable in the regression equation.
Where, 1 x is the unit line length, 2 x is insulation rate, 3 x is the number of defects, 4 x is the degree of thunder, 5 x is the operation years, 6 x is the overload, 7 x is the surrounding trees, 8 x indicates whether there are outside objects such as colored steel around.
The most important purpose of calculating the risk value expression is to divide the risk value into different intervals, and obtain the corresponding table of the risk value and the number of line faults. So a strong correlation is required between the risk value and the number of faults.
In order to quantify the correlation between the risk value and the number of faults, Spearman coefficient is introduced. The calculation formula of the two-variable Spearman coefficient is in (12): After calculation, the Spearman coefficient of the risk value and the number of line faults is 0.4069, and the correlation is weak. In the expression of the risk value, the coefficient

Conclusion
In this paper, firstly stepwise regression is performed to screen out the factors that affect line faults, so that obtains 8 key fault factors and their importance ranking; then, through the principal component analysis method, the principal component scores of each line are achieved and are regarded as the risk value of the line. Finally, 8 key failure factors are used as independent variables, and the risk value normalized by risk value score is used as the dependent variable. The stochastic gradient descent method is carried out so as to train the parameters of the multiple linear and exponential regression equations, and the risk value-based Quantitative expression. Therefore, the risk value of the line can be predicted scientifically based on the value of the key failure factor, and the operation and maintenance of the distribution grid line can be guided.