Study on pollution of different tributaries of Bai Ta Pu river based on principal component analysis

: On the basis of understanding the indexes of different tributaries of Bai Ta Pu river and according to the contribution rate of each principal component, the comprehensive score of different tributaries was calculated through principal component analysis. Each tributary was ranked according to the score, and the evaluation results were analyzed. It is pointed out that the tributary with the highest comprehensive index of Bai Ta Pu river is in Shi Jia Zhai bridge and the tributary with the lowest comprehensive index is in De Sheng Tun bridge.


.1 Principle of principal component analysis
Puckett et al. evaluated the water quality of some rivers in Virginia for the first time with principal component analysis in 1992, which was the first application of principal component analysis in water quality evaluation [1][2][3][4]. Principal component analysis is a method of dimensionality reduction by recombining the original indexes with certain correlation into a group of new unrelated comprehensive indexes.

The mathematical model of principal component analysis
F 1 ＝a 11 ZX 1 ＋a 21 ZX 2 ……＋a p1 ZX p F 2 ＝a 12 ZX 1 ＋a 22 ZX 2 ……＋a p2 ZX p F m ＝a 1m ZX 1 ＋a 2m ZX 2 ＋……＋a pm ZX p a 1i , a 2i , and a pi are the eigenvectors of X's Covariance matrix Σ corresponding to his eigenvalue, ZX 1 , ZX 2 , and ZX p are the normalized value of the original variables. In practice, there are always different dimensions of indicators, so the influence of dimensions should be eliminated before calculation and the original data should be standardized (data standardization in this paper refers to Z standardization)。 A＝(a ij ) p×m , R ai ＝λ i ×a i , R is the correlation coefficient matrix, λ i and a i are eigenvalue and unit eigenvector, respectively (λ 1 ≥ λ 2 ≥…≥ λ p ≥ 0).

The basic steps of principal component analysis
The main steps are as follows [5][6][7]. We selected indicators and data according to research questions firstly, and Standardized indicator data and converted to dimensionless data using statistic package for social science software secondly. Then we determined correlation between indicators thirdly, determined the number of principal components fourthly, expressed principal component Fi fifthly, name the principal component Fi sixthly. Finally we evaluated principal component and comprehensive principal component [6][7][8].

Specific operation steps
Factor process from SPSS statistical analysis software was used to conduct principal component analysis of Bai Ta Pu river's indicators. The specific operation steps were as follows. Firstly, analysis-dimension reduction-factor analysis-popup factor analysis dialog box. Secondly, we selected X 1~ X 11 into the variable box. Thirdly, description-correlation matrix-factor analysis-clicked "continue"-returned to factor analysis. Fourthly, we clicked "OK".
The factor analysis process was invoked by SPSS and the raw data was automatically standardized. Therefore, the variables after the calculation result were all the variables had been standardized. SPSS didn't give standardized data directly. If standardized data was needed, descriptive statistical analysis procedure was called for calculation. We could do this through the "analysis-descriptive statistics-description" dialog box. After popping descriptive statistical analysis dialog box, selected X1~X11 into the variable box, saved the normalized value as a variable, clicked "OK". Standardized data was automatically filled in the data window and named that started with Z.
The principle of extracting the number of principal components was the first m principal components whose eigenvalue was bigger than 1. To some extent, the eigenvalue can be regarded as an indicator of the influence strength of principal components. If the eigenvalue was smaller than 1, it indicated that the explanatory power of the principal component was not as strong as the average explanatory power of directly introducing an original variable. Table 1 was the extraction and analysis table of variance decomposition principal components. It can be seen from Table 2 that 4 principal components were extracted (i.e. m = 4). Table 2 was the loading matrix of initial factor. CODcr, NH3-N, T-N, and T-P contained higher load on the first principal component. It indicated that the first principal component basically reflected the information of these indicators. T, pH, SS, and DO contained higher load on the second principal component. It indicated that the second principal component basically reflected the information of these four indicators. NO 3 -N and NO 2 -N contained higher load on the third principal component.  It indicated that the third principal component basically reflected the information of these two indicators. Acute toxicity inhibition contained higher load on the fourth principal component. It indicated that the fourth principal component basically reflected the information of this indicator. The extracted four principal components can basically reflect the information of all indicators. So these 4 new variables were decided to replace the 11 original ones. The expression of these four new variables cannot be directly obtained from the output window, because each load quantity of the initial factor load matrix represented the correlation coefficient between the principal component and the corresponding variable.
Divided the data by the eigenvalues corresponding to the principal components and took the square root to get the coefficients corresponding to each index in the four principal components (namely the eigenvectors). The four columns of data in the initial factor load matrix were variables B1, B2, B3, and B4. A1 ＝ B1/SQR (characteristic value of 3.609)(Note: the second principal component SQR was followed by the second eigenvalue 2.430, while the third and fourth principal components were 1.693 and 1.028, respectively). We got the eigenvector A1. And similarly we got the eigenvectors A2, A3, and A4. Multiplied the obtained eigenvector and the normalized data, and then the principal component expression can be obtained.  Table 3.
This study was to comprehensively analyze various indicators of Bai Ta Pu river, such as ammonia nitrogen, nitrate nitrogen, nitrite nitrogen, total phosphorus, chemical oxygen demand, total suspended solids, acute toxicity and so on [9][10][11][12]. Principal component analysis was used to determine the dimensionality reduction of each index of the river to a few indexes, and to comprehensively evaluate the advantages and disadvantages of sampling points of different tributaries, so as to provide a basis for understanding the characteristics of water quality distribution of Bai Ta Pu river [13][14][15].
It can be seen that the tributary with the highest comprehensive index was Shi Jia Zhai bridge, followed by the south canal, and the sampling point with the lowest comprehensive index is De Sheng Tun bridge. It indicated that the pollution sources near Shi Jia Zhai bridge had a serious impact on the sub-watershed water system. Bai Ta Pu river entered the middle reaches from Shi Jia Zhai village. The place was the transition zone from low hills to Hun River flood plain. The terrain was relatively flat, with an average elevation of about 50 meters. The water pollution composite index of Shi Jia Zhai section rose dramatically, which was divided into rural section and urban section, and the water pollution in urban areas was more serious than in rural areas. Sources of pollution were living and non-point sources and sewage disposal from sewage treatment plants [16]. Hu Jiguo also found that water quality of the Bai Ta Pu river was bad mainly due to tributaries pollution. The tributary of Shi Jia Zhai was seriously polluted, which should be reorganized in order to reduce the pollution flowing into the main stream [16].