Research on the order parameter selection algorithm based on correlation analysis and principal component analysis——Taking the Logistics sector in Gansu Province as an example

An order parameter selection algorithm based on correlation analysis and principal component analysis was designed according to the statistical analysis method, the selection principle of order parameters of social system, and the correlation test in correlation analysis and the variable contribution test in principal component analysis in this paper. The redundant variables were eliminated from the system by correlation analysis first, and then the variables with high contribution to the system were selected by principal component analysis, so the order parameters obtained accordingly not only have low information redundancy, but also reflect the actual information of the social system to the greatest extent. At the end of this paper, the logistics sector in Gansu Province was taken as an example to select the panel data from 2006 to 2015. Eight indices were extracted as the order parameters of the logistics sector in Gansu Province from the sixteen indices which are redundant selected by this algorithm. The order parameters selected by rational judgment reflect 99% of the original information. The results show that the order parameters in the social system can be correctly and reasonably selected by this order parameter selection algorithm based on correlation analysis and principal component analysis.


Introduction
Each phase of the system development experiences the transformation between quantitative change and qualitative change to different extent. From the point of view of the orderly development of the system, every phased development of the system is the result of its internal order parameter-oriented development. The socalled order parameter is the key variable of the leading system to complete the structural transformation and thereby achieve a higher level of order, which reflects the total contribution of all the internal factors to the cooperative motion of the system, and exhibits the ordered structure and type of the whole system. The order parameters have such characteristics, so the correct selection of the order parameters is very convenient to research various problems about the system.
For a system, the importance of the order parameter is self-evident. For the system in the phase transition period, its internal variables are divided into two types, fast relaxing variables and slow relaxing variables. The system has many fast relaxing variables and few slow relaxing variables. The fast relaxing variables have little influence on the development of the system due to their high decay rate, while the slow relaxing variables play a decisive role in the phase transition of the system due to their low decay rate and thereby are the key to guiding the development of the whole system. Therefore, we refer to these slow relaxing variables as the order parameters. Different forms of order parameters exist in the systems in the fields of social science and natural science. In the field of natural science, the order parameters may be some particles or some crystals with special structure [1] . In the field of social science, the order parameters may be some subsystems or some indices [2,3] . It can be seen that the order parameters exist in many forms. Correct identification of the order parameters in the system is very important to understand the self-organization behavior and cooperative motion of the system.
In the field of natural science, the order parameters are usually identified by using the experimental method for repeated phase transformation of the system based on the characteristics of the system itself. However, this method is often impractical and difficult to achieve in the field of social science. How to select and quantify the order parameters in the social system has become the key to study on the orderly development of the social system. In the social system, most researchers choose the indices that can comprehensively measure the social system as the order parameters [4][5][6] ; some researchers also determine the order parameter of the system by such an exploration diagram method that all the factors of influencing the cooperative motion of the system and their interactive and hierarchical relationship are depicted in a diagram mainly based on the observation of the internal and external environment of the whole system, knowledge structure and information, full imagination, and considerations of larger environment issues [7,3] ; some researchers also select the order parameter by such a data mining algorithm that the reduction set is found by the information entropy reflected by the index data over the years, and the indices that do not change the information entropy of the system are eliminated [8 ， 9] . Different from the previous order parameter selection methods, an order parameter selection algorithm based on correlation analysis and principal component analysis was designed from a new perspective on the selection of order parameters mainly according to the statistical analysis method and selforganization theory in this paper. This algorithm takes the low redundancy and high contribution of the order parameters as the core idea, and its rationality and applicability are verified in the example of order parameter selection from the logistics sector in Gansu Province. Thus, the reasonable and objective order parameters can be selected for various social systems by this algorithm.

A. Principle of Order Parameter Selection
The order parameters should be selected according to a certain principle. However, the order parameters in the systems in the fields of social science and natural science are selected according to different principles due to a great difference in the forms of systems between social science and natural science. In this paper, the selection principles of order parameter in social system are discussed. The specific selection principles are as follows:

1)
Macro-representative The order parameters are the parameters to describe the macro action of the whole system. In the synergetics, Hermann Haken believes that the order parameter can reflect the ordered structure and type of the whole system [10] . Therefore, the order parameters are macrorepresentative, and their change can reflect the overall development of the system to a certain extent.

2)
Systematic As the internal variables of the system, the order parameters influence one another. Such influences are mapped to the whole system as the total contribution of all the order parameter to the orderly motion of the whole system. Thus, the order parameters are systematic.

3)
Quantifiable The order parameters centrally embody the ordered motion of the system. The study on the order parameters is the key to the development of the system. Thus, the order parameters are quantifiable and thereby guarantee the operability of the orderly development of the research system.

4)
Simplified The order parameters are slow relaxing variables. The system has few slow relaxing variables, but they play a leading role in the system, that is, a few dominate the majority. Thus, the order parameter should be selected based on slow relaxing variables in order to ensure its simplicity.

5)
Hierarchical A complicated system often contains several subsystems, which contains more micro subsystems, which contain the order parameters with different meanings, functions and forms. The contributions of these order parameters will be hierarchically summarized in the whole complicated system. Thus, the order parameters are hierarchical.

B. Order Parameter Selection Algorithm
Two statistical analysis methods, correlation analysis and principal component analysis, mainly were used in this paper. On one hand, the correlation analysis describes the degree of correlation between matters, so redundant variables were eliminated from the system state variables by correlation analysis in order to ensure the simplicity of variables. On the other hand, principal component analysis can reduce the dimension of data set through the idea of dimensionality reduction and ensure the maximum contributions of variances and high contribution of variables. The algorithms are introduced as follows:

1)
Correlation Analysis Assuming there is a variable set contains n n variables in the system, the correlation coefficient between variables i and j can be calculated by the following formula: Where cov�i, j� is the covariance between variables i and j, σ � ，σ � are the standard deviations of variables i and j, respectively. The covariance between and standard deviations of the variables are reflected by the data distribution of the variables. ρ �,� is between -1 and 1. The closer to 1 ρ �,� is, the higher the positive correlation is; the closer to -1ρ �,� is, the higher the negative correlation is; the closer to 0 ρ �,� is,, the lower the correlation is. In most studies, 0.9 is set as the critical value of high correlation of variables [11] , and thereby also used as the critical value of high correlation of variables in this paper. It is noted that correlation analysis is a data analysis method, and its correlation degree depends on the distribution of the data itself. Thus, it is likely that the data is high correlative, but the actual importance of the variables is not correlative during selection of variables by correlation analysis. Thus, the actual importance of the variables should be considered, and these variables should be retained.

2)
Principal Component Analysis Assuming there are m variables, X � �X � , X � , ⋯ , X � �, in the system, the principal component model of the variable is as follows: Where F � represents the i th principal component, a �� is the j th component of the characteristic vector corresponding to the i th characteristic value, X � is the observation value of the i th variable, and m is the number of principal components. The i th principal component is expressed as a linear combination of the variable X to reflect the information of the original variables [12] . The steps are as follows: The values of the variables is normalized by zscore first to calculate the correlation coefficient matrix �R� ��� of m variables and obtain the characteristic value and characteristic vector of the correlation coefficient matrix.
|R-λI|� 0 (3) The characteristic root λ ranks in the descending sequence, λ � � λ � � � � λ � , where λ � reflects the original information content represented by the i th principal component, i.e. the total variance of the data of the explained original variables. The corresponding variance contribution is [12] : The variance contribution represents the ratio of the original information content represented by the i th principal component to the information content represented by all the principal components. The principal components with the cumulative variance contribution of greater than or equal to 85% are retained, and the principal components corresponding to the first p characteristic values are selected to obtain the factor load b �� of the i th variable on the j th principal component as follows: b �� � a �� �λ � (5) The contribution of a variable to the system can be determined by the factor load. The variable with larger absolute value of factor load more significantly influences the system. The more simplified variable set can be obtained by eliminating the variables with smaller factor load.

3) Order Parameter Selection Algorithm Based on Correlation Analysis and Principal Component
To select the order parameters from the social system, the common indices in the evaluation system have been taken as the order parameters of the system in some studies so that it is very likely that the indices are high correlative. In addition, some indices may not well reflect the actual development of the system and ultimately affect the follow-up research.
Based on the above problems, an order parameter selection algorithm based on correlation analysis and principal component analysis is designed in such an idea that on the basis of the redundant variable set selected from the system, the correlation analysis of the variables is carried out and the redundant variables in the selected redundant variable set are eliminated first to ensure the simplicity; then the principal component analysis of the remaining variables is carried out accordingly and the variables with high contribution to the system are selected by the factor load in order to obtain the order parameter of the system. This method ensures that the selected order parameters have low redundancy and high contribution.
At the end of the selection, it is necessary to determine the rationality in order to ensure that the selected order parameters can fully reflect the development of the system. The rationality is determined mainly based on whether the information content of the variance ratio test-based order parameter accounts for more than 85% of the information of the originally redundant variable sets or not [11] : ln � trS � /trS � (6) Where trS � represents the sum of variances of the selected order parameters, trS � represents the sum of variances of the originally redundant variable sets, and the ratio of trS � to trS � represents the information contribution of the order parameter. For the process flow diagram of the order parameter selection algorithm based on correlation analysis and principal component, see Fig.  1.

Fig. 1 Process Flow Diagram of The Order Parameter Selection Algorithm Based On Correlation Analysis And
Principal Component

Selection Of The Order Parameters from The logistics sector in Gansu Province
According to the relevant statistical indices of the logistics sector in China Statistics Yearbook and the researchers' research conclusions of the index system of the logistics sector, the redundant variable set was mainly determined on the four aspects, capital scale, demand scale, infrastructure and supply capacity, as shown in Table 1.

Input data of redundant variable set
Algorithm start (①)

Principal component analysis of residual variables
Algorithm termination Obtain the order parameters of the system If the cumulative information content of the remaining variables is higher than 85% Retain principal components with cumulative contribution rate higher than 85% Eliminate the variables with smaller factor load in the principal component N Y Remove the variables whose correlation coefficient is higher than 0.9 and whose actual meaning is related, and retain the remaining variables ① 4 The panel data of Gansu Province from 2006 to 2015 were obtained from the website of National Bureau of Statistics of the People's Republic of China. The order parameters of the logistics sector in Gansu Province were selected by the order parameter selection algorithm based on correlation analysis and principal component analysis. The data were analyzed and processed by SPSS software. Table 2 shows the correlation analysis results of the redundant variable set selected from Gansu Province. As seen from Table 2, we can make the following analysis.
a) The coefficient of correlation between the value-added of the traffic, transport, storage and post sectors x1 and the turnover volume of freight transport x7 is 0.925, indicating a strong correlation. The increase of the turnover volume of freight transport is correlative with the increase of the output value of traffic, transport, storage and post sectors, so the two indices are high correlative. One of them can be retained. The valueadded of the traffic, transport, storage and post sectors is eliminated herein.
b) The coefficient of correlation between total post business volume x2 and the express delivery volume x4 is 0.952, indicating a strong correlation and overlapping and inclusive relationship between x2 and x4.
c) The coefficient of correlation between total post business volume x2 and fixed assets investment in the traffic, transport, storage and post sectors x3 is 0.912, indicating a strong correlation. The fixed assets investment in the post section promotes the increase of total post business volume. Thus, the fixed assets investment in the traffic, transport, storage and post sectors can be eliminated. f) The total post business volume x2 , possession of civil freight cars x15 and possession of private freight cars x16 are high correlative too, but are the evaluation indices in different fields and thereby are not correlative in actual importance and should be retained. In the same way, the freight volume x6 , length of postal routes x9 and the number of employees in the logistics sector x12 are not correlative in actual importance and should be retained. The railway operating kilometers X10 and number of postal business outlets x8 are not correlative in actual importance and should be retained.
The following indices were selected from the logistics sector in Gansu Province by correlation analysis: total post business volume, number of packages, freight volume, turnover volume of freight transport, number of postal business outlets, highway kilometer, the number of employees in the logistics sector, possession of civil freight cars, and length of postal routes. The next step of the algorithm was principal component analysis. For principal component analysis and component matrix, see Tables 3 and 4.  After extraction of the principal components 1 and 2, the cumulative contribution exceeds 85%. It is found from the factor load of each index in Table 4 that the number of packages x5 has a smaller factor load in the principal components 1 and 2 and thereby can be eliminated first, while the other indices have bigger factor loads and higher contributions to the system, reflect an aspect of development of the logistics sector and thereby should be retained. It is found from identification of the rationality according to Formula (6) after primary selection of the order parameters that the information content of the selected order parameters accounts for 99% of the information content of the originally selected index set, which meets the requirements for identification of the rationality. Finally, the following order parameters of the logistics sector in Gansu Province are determined: total post business volume, freight volume, turnover volume of freight transport, number of postal business outlets, highway kilometers, the number of employees in the logistics sector, possession of civil freight cars, and length of postal routes.
The development of the logistics sector in Gansu Province mainly relies on highways and railways. The logistics sector in Gansu Province, which is a mountainous province, mainly relies on motor vehicles. Moreover, the post sector is a pillar sector in the logistics sector in Gansu Province. The order parameters selected from the post sector comply with the actual development of Gansu Province. Thus, the order parameters selected by this algorithm not only have high contribution and low redundancy, but also reflect the actual development of the system and have rationality and availability.

Conclusion
Correct selection of the order parameters from the system is very important to the orderly development of the research system. If the index system of the evaluation system is taken as the order parameters of the system, the follow-up research will deviate due to redundancy, contribution, etc. According to the principle for selection of the order parameters from the social system, the order parameter selection algorithm based on correlation analysis and principal component analysis not only achieves the low redundancy of the order parameters, but also the high contributions of the order parameters to the system.