Transmission line engineering cost prediction based on principal component analysis and least square support vector machine

. Due to the many factors affecting the cost of transmission line engineering and the lack of mutual independence, it is difficult to predict the cost. Firstly, the principal component analysis is used to process the original indicator data, eliminating the correlation between the original indicators and extracting the potential comprehensive independent indicators. Then, the new indicator is used as the input set to construct the predictive learning model based on the least squares support vector machine, and the predicted output and the actual value are compared and analyzed. The results show that the model can achieve the desired prediction effect in the case of small samples.


Introduction
The cost level of transmission line engineering is a multivariable and highly nonlinear problem, which is manifested in the fact that there are many factors affecting the cost of transmission and transformation engineering, and they are not independent of each other, but form a complex relationship. It has made it more difficult to predict the cost of power transmission and transformation projects. Traditionally used cost prediction methods include fuzzy mathematics, analog engineering, etc., but these methods are cumbersome and the prediction accuracy can not meet the demand of reasonable cost prediction [1]. With the application of some mathematical methods and intelligent algorithms in forecasting, many new prediction methods have emerged, including classical prediction methods represented by trend analysis and regression analysis, and intelligent prediction methods represented by neural networks and support vector machines [2]. The classical prediction method is easy to understand and easy to operate, but it has great limitations in dealing with nonlinear problems. The intelligent prediction method has great advantages in dealing with nonlinear problems, and the prediction accuracy is greatly improved compared with the classical prediction method, but it is also prone to over-fitting, falling into local optimum and so on [3]. In this paper, principal component analysis is firstly adopted for factor dimensionality reduction, and then the least square support vector machine is used for cost prediction.

Principal Component Analysis
Principal Component Analysis (PCA) is a dimension reduction algorithm commonly used in data mining. It was first proposed in Pearson's research on non-random variables in 1901, and later Hotelling extended this method from the case of non-random variables to the case of random vectors in 1933 [4]. Principal Component Analysis is a multivariate statistical method that uses the idea of dimensionality reduction to convert multiple indicators into several comprehensive indicators under the premise of losing little information. Usually, the comprehensive index generated by transformation is called the principal component, and each of the main components are linear combinations of the original variables, and the main components are not related to each other, so that the principal components have some superior performance than the original variables.
The basic principle of principal component analysis is to regroup a set of variables

Least Squares Support Vector Machine
Support Vector Machine (SVM) is a new algorithm based on statistical machine learning theory proposed by Vapnik et al. in 1995. It has a wide range of applications in feature extraction, pattern recognition and regression analysis. The support vector machine improves the generalization ability of the learning machine by seeking the minimum structural risk rather than the empirical risk in the traditional statistical learning, and finally achieves the minimum empirical risk and confidence range [5]. The significant advantage of the support vector machine is that it can better adapt to the small sample situation, and has a good treatment for the over-learning problem. Suykens J.A.K proposed the Least Squares Support Vector Machine Theory (LSSVM) based on SVM in 1999. The significant difference between it and SVM is that LSSVM uses another optimization objective function to replace the SVM inequality constraint by least squares method. The equality constraint further improves the accuracy of learning [6].  The original problem can be transformed into a dual problem through Lagrange duality, which makes the problem easy to solve. At the same time, the Lagrange function is constructed for the purpose of extending to nonlinear classification problems and then introducing kernel functions [7] : The optimal solution of equation (2) is as follows: Eliminate  ， e and then get a set of linear equations about  ， b : In this formula, ; I is an n-order unit matrix. Then the regression function of the least squares support vector machine can be expressed as:

Transmission line engineering cost prediction model
Combined with the actual situation of transmission line engineering cost, this paper establishes a cost prediction model based on principal component analysis and least squares support vector machine. Firstly, the raw data index is extracted into the principal component to achieve the purpose of dimension reduction and correlation reduction. Then the principal component is used as a new input indicator set, and the sample data is divided into two parts, one is used as training sample and the other is used as test sample. First use the training samples to let LSSVM learn autonomously, then input the test samples into the training network, get the predicted output, and compare with the actual output to verify the validity of the model.

Sample indicator data selection
This paper selects the cost data of 110kV overhead lines whose number is 46 in a certain area in 2018 as sample data to verify the cost prediction model. Because of the large number of index data, it is necessary to firstly screen and simplify the index data. 16 index data such as line folding length, tower material quantity and unit length cost were selected, in which the unit length cost was used as the dependent variable, and the other 15 indicators were the influencing factors, as the independent variables, as shown in Table 1. Among the 15 influencing factors obtained above, the number of line loops, topographical distribution and geological conditions are descriptive variables, so they need to be quantified. The principle is to classify them by numbers 1, 2, 3, etc., the weighted averaging process is then performed on an overall scale.   From the total variance interpretation table, the eigenvalues of the first five factors are greater than 1, and when the five factors are extracted, the cumulative contribution rate has reached 85.143%, which explains the variance content of more than 85%, so the required extraction can be determined. The number of principal factors is five, and the same conclusion can be obtained intuitively through Figure 1. After determining the principal component factor, the original influencing factor data matrix and the component score coefficient matrix (Table 3) are combined to calculate the factor score value, which is used as the final training and test data of the next stage overhead line engineering cost prediction model.

LSSVM prediction
This article uses the LSSVM toolkit added to the MATLAB software to write a program for cost prediction. First, the sample data obtained in 3.2 is divided into two groups. The first 36 sample projects are grouped as a training set, and the last 10 sample projects are grouped as a test set. Then, the training sample data is input into the least squares support vector machine model for training learning, and the training sample fitting situation is shown in the figure 2. Finally, the test sample data is used as an input to test and obtain the predicted result. Compared with the actual value, the predicted error rate is calculated, as shown in the table 4 and figure 3.   As can be seen from table 4, the deviation rate between the predicted cost and the actual cost of the 10 projects in this cost prediction is less than 5%, and the relative average absolute error is 4.03%. The deviation between the estimated cost and the actual project settlement is allowed to be 5%~10%, and the deviation between the project budget and the actual project settlement is allowed to be 3%~5%. Therefore, the prediction results show that the project cost prediction model based on principal component analysis and least square support vector machine has high accuracy in the project cost prediction of overhead lines.

Conclusion
In this paper, the principal component analysis and least squares support vector machine are combined to construct the transmission line engineering cost prediction model, and the model is trained and tested by using the 110kV overhead line engineering cost index data in a certain area. The training results are better and the prediction results are obtained. The difference between the actual and the actual is only 4.03%, which indicates that the cost prediction model established in this paper is scientific and effective, and the prediction accuracy is high. In the case of small samples, the cost of the power transmission and transformation project can be accurately predicted, which has certain reference to the project budget. So this model has a good application prospect.