An abnormal traffic detection method in smart substations based on coupling field extraction and DBSCAN

. Smart Substation becomes more vulnerable to cyber attacks due to the high integration of information technologies, so it is essential to detect intrusion behaviour by abnormal traffic analysis in smart substations. Although there have been many detection methods for abnormal traffic, the existing ones all focus on the format check of a single field of the industrial transmission protocol, and ignore the deep coupling relationships among multiple protocol fields, which lead to more or less false detections and missed detections. To overcome this problem and further improve the detection accuracy, in this paper, we propose an abnormal traffic detection method based on the coupling field extraction and the density-based spatial clustering of applications with noise (DBSCAN). By using correlation analysis to extract the coupling fields of the protocol fields and using DBSCAN to remove the noise in the coupling fields, the deep coupling relationship between the coupling fields can be mined by the piecewise linear function fitting method, and used to detect abnormal traffic. The simulation results on 10,000 frames traffic prove that the proposed detection method can effectively identify the abnormal traffic.


Introduction
With the continuous coupling of communication networks and power systems, traditional power grids are rapidly transforming into cyber-physical power systems (CPPS) [1]- [2] . In the modern smart grid, the electricity supply network plays the role of transmitting electric energy, and the communication networks are responsible for transmitting control signals and so on. On the one hand, the power system has become more flexible because of the integration of communication technologies. On the other hand, since the communication network will transmit many important data of the power system, how to ensure the security of the data transmitted in the communication network has also become a huge challenge. In other words, it is very necessary to accurately detect abnormal traffic in the power system.
The smart substation is the core node of the smart grid. The accuracy of the data transmitted through the smart substation directly affects the safe and stable operation of the power grid. At present, the communication protocol between a smart substation and the control center is the International Electro technical Commission (IEC) 60870-5-104 (recorded as IEC-104), which is more likely to be attacked due to the lack of the corresponding encryption and authentication mechanism in the data transmission process [3]- [4] . Therefore, it is necessary to develop the corresponding anomaly traffic detection for the IEC-104 protocol to improve the security level of the power system, which also falls the scope of this paper.
Considering the frequent occurrences of cyber events in recent years, several abnormal traffic detection methods have been proposed to mitigate the risk of cyber attacks. In ref. [5], a method based on the principal component analysis (PCA) was proposed to detect the modified data by separating the power flow variability into regular and irregular subspaces. And a Kullback-Leibler distance (KLD) detection method was proposed in ref. [6]. The authors in [7] proposed an abnormal data detection method to explore the spatial and temporal inconsistency between normal data and abnormal data by using discrete wavelet transform and deep neural network. Besides, some detection methods based on machine learning have also been studied. In ref. [8], an abnormal traffic detection method based on Particle Swarm Optimization (PSO) optimal Elman neural network was proposed. And a deep learning based long short term memory method was proposed to detect the abnormal traffic in [9]. More works about traffic detection can be found in [10]- [13].
However, most of the above detection methods focus on the general statistical characteristics of the traffic in substations, and only consider the format check of a single field of the IEC 104 protocol. The coupling relationships between the protocol fields are ignored. Even the machine learning method has been used for traffic detection, the detection rules of protocol fields cannot be well set up, leading to missed detections or false detections with a high probability.
To overcome the shortcomings of the existing detection methods, we in this paper propose an abnormal traffic detection method based on the coupling field extraction and DBSCAN. First, the coupling fields are automatically extracted by using correlation analysis. Then, the DBSCAN method is used to remove the noise points in the coupling fields. Finally, the deep relationships between the coupling fields are mined by using piecewise linear function fitting, which are used to detect the abnormal traffic.

The IEC 60870-5-104 protocol
The functionality of the IEC-104 mainly relies on the TCP/IP, which inevitably introduces many network security problems into the smart substation. So, it is necessary to carry out abnormal detection on the traffic data of the IEC-104 protocol. First, we analyze the protocol structure of the IEC-104. The application layer of the IEC-104 protocol consists of a basic communication unit, named as Application Protocol Data Unit (APDU), and APDU can be further divided into Application Protocol Control Information (APCI) and Application Service Data Unit (ASDU) [14]. The detailed structure of the IEC-104 is illustrated in Figure 1.
As shown in Figure 1, the IEC-104 protocol defines the start character (68H) and the maximum length specification of APDU (max.253) and control fields. By defining different values of control fields, the IEC-104 protocol formulates three corresponding APCI frame formats, which are I-format, S-format, and U-format, respectively. In practice, the definition of each field of an ASDU customized by the IEC 104 used by a smart substation are as follows: The field of type identification (TI) represents the byte length of the data in information object; The field of variable structure qualifier (VSQ) represents the number of the data in information object; And the cause of transmission (CT) represents the command type corresponding to the frame message, such as sudden upload (03), activation (06), activation confirmation (07), response to the general call (20) and so on. The meaning of the other fields in the ASDU is straightforward and will not be explained.
However, the IEC-104 protocol lacks any encryption measures, which makes it vulnerable to cyber attacks, such as DDoS attacks, false data injection attacks, man-in-themiddle attacks and so on. Therefore, it is necessary to carry out anomaly detection on the data transmitted by the IEC-104 protocol.

The detection method based on coupling field extraction and DBSCAN
In this section, we proposed an abnormal traffic detection method based on coupling field extraction and DBSCAN for smart substations (take the IEC-104 protocol as an example in this paper). As mentioned in section I, the existing abnormal traffic detection methods are partly based on the payload transmitted in the protocol, and partly on the statistical characteristics of the traffic, which both ignore some inherent features contained in each frame of traffic itself. However, if the data in the traffic is slightly tampered with, the detection methods based on the above two principles will not be able to detect such attacks. To overcome the disadvantages of the existing detection methods and discover the internal relationships among protocol fields, a coupling field extraction method is proposed to perform the correlation analysis. Then, the DBSCAN clustering algorithm is used to further reduce the noise in the data of the coupling field.

The method of coupling field extraction
Since the master station stores a large amount of traffic data, it provides a basis for correlation analysis among protocol fields. By using correlation analysis, we can know the degree of correlation between two or more protocol fields, that is, when one field changes, the other field also changes. This phenomenon is more obvious in the protocol traffic of the smart substation. If the correlation between some fields is very prominent, they are defined as the coupling fields. Furthermore, the correlation of coupling fields is used to verify each protocol field to achieve the purpose of abnormal traffic detection.
Pearson correlation coefficient is currently one of the most commonly used correlation analysis methods, but it can only be applied to continuous data, and there are discrete data in the IEC 104 protocol, so this paper adopts the Spearman correlation coefficient method. In addition, to further improve the reliability of the correlation analysis, the Kendall correlation coefficient method is also employed to further improve the reliability of the correlation analysis for increasing the accuracy of the abnormal traffic detection.
Assuming that the dataset of two random protocol fields are X, Y, and the number of two datasets both are N. First, we sort X and Y (ascending or descending at the same time), and get two new datasets x, y. And � � and � � represent the i-th element in x and y, respectively. Then, we can calculate the Spearman correlation coefficient by using the following equation: where � is the value of Spearman correlation coefficient. II: Kendall correlation coefficient.
Assuming that the dataset of two random protocol fields are X, Y, and the number of two datasets both are N. And � � and � � represent the i-th element in X and Y, respectively.
we denote them as equivalent. Then, we can calculate the Kendall correlation coefficient by using the following equation: where � is the value of the Kendall correlation coefficient. A represents the number of elements in X and Y that meet the consistency, and B represents the number of elements in X and Y that meet the inconsistency. And � and � are the number of equivalent sets in X and Y, � � and � � represent the number of elements in each equivalent set of X and Y.

The DBSCAN clustering algorithm
DBSCAN is a typical density-based clustering method which has been proved to have the advantage of finding noise points which represent the extreme scenarios of coupling fields. This clustering model contains two important parameters: Eps (the threshold of the neighborhood distance) and MinPts (the minimum number of samples in the neighborhood), which can be recorded as � and �. DBSCAN can automatically classify data according to the density of data, and find the noise points that are considered as extreme scenarios. As shown in Figure 2, the core idea of DBSCAN is to start with one point, if the point is a core point, then expand the cluster of this point according to the density-reachable rule, and filter out all the core points, border points and noise points based on those rules in [15]. By using the DBSCAN method, we can find and reduce the noise points in the data of coupling fields. And the selection of the two parameters of DBSCAN (� and �) will be given in the case study. Moreover, the above coupling field extraction and DBSCAN method are the basis of the abnormal traffic detection, the corresponding detection methods are still needed to make reasonable use of the features obtained by the two parts, which will be discussed in the next section.

The abnormal traffic detection method
As mentioned above, after the processing of the coupling field extraction and abnormal coupling field elimination, a detection model is needed to measure the numerical relationship between coupled fields.
The whole process of the proposed abnormal traffic detection method is illustrated in Figure 3, and the detailed steps are summarized as follows: Step-1: Collect historical traffic data and parse out the value corresponding to each protocol field.
Step-2: Calculate the correlation coefficient matrix of all protocol fields by using the Spearman correlation coefficient method and the Kendall correlation coefficient method. Define two protocol fields with a correlation coefficient not less than 0.6 as a coupling field, and extract all coupling fields according to the correlation coefficient matrix.
Step-3: Determine the noise points of the coupled fields of discrete data by using the DBSCAN method. It is worth noting that the DBSCAN clustering algorithm is only for the coupled fields of discrete data.
Step-4: Aiming at the coupled fields of continuous data, this paper uses piecewise linear function fitting to realize the mining of numerical relationships. Next, we will explain in detail how the piecewise nonlinear function fitting is achieved. 1) Piecewise: Assuming that the extracted coupling fields after DBSCAN are coupling fields A, B and C. The data in coupling fields A and B are continuous, and the data in coupling field C is discrete. Therefore, for each coupling field element � � in C, a function fitting � � ���� of its corresponding coupling fields � � and � � in A and B needs to be made. This process is defined as piecewise fitting in this paper.
2) Normal coupling field: As mentioned above, we need to perform function fitting for each coupling field element � � in C. And if the � � can correspond to a series of coupled data �� � , � � �, �� ��� , � ��� �, … , �� ��� , � ��� � in A and B, we can easily fit the corresponding function. This process is defined as normal coupling field fitting in this paper, and the linear function fitting method based on the least square method is as follows: where � � is the fitting error of sample �� � , � � �. And � � � and � � � are the constant coefficients of the fitting function.
Generally, the purpose of the least squares method is to minimize the squared loss function value corresponding to the fitting function, where the squared loss function�� is calculated as follows: By performing the function fitting on the normal coupling field data, one can get the values of , and , which will be used in the subsequent abnormal traffic detection.
3) Special coupling field: If the only correspond to a set of coupled data in A and B, it is difficult for us to determine a unique set of the corresponding function. This process is defined as special coupling field fitting in this paper.
In order to reduce the appearance of more different piecewise functions, we first use the function coefficients obtained by the normal coupling fields fitting to verify whether the special coupling fields can be satisfied. If yes, one uses the corresponding function coefficient; if not, the linear fitting method is used above to refit for obtaining the corresponding function coefficient.
Step-5: Extract the corresponding coupling field of the real-time traffic, and determine whether the coupling field is the noise points selected by the DBSCAN clustering algorithm. If yes, record it as the abnormal traffic; if not, go to step-6.
Step-6: Calculate the error according to the corresponding piecewise function coefficient. If the error of the real-time traffic exceeds the corresponding value , record it as the abnormal traffic; otherwise, record it as the normal traffic. Based on the above proposed abnormal traffic detection method, we can detect abnormal traffic that cannot be detected by the conventional methods, even if the protocol data has been slightly tampered with. Unlike the existing methods that only perform abnormal detection on the general statistics of network traffic, the proposed method investigates the coupling relationships between protocol field data. By doing so, the proposed method can further reduce, and improve the detection accuracy of abnormal traffic by reducing the rate of missed detection.

Case study
In this section, we use 100,000 frames of traffic data (IEC-104) from a substation in a province of China to test the proposed abnormal traffic detection. Simulations are carried out on a 3.9GHz personal computer with 8GB of RAM.

The result of coupling field extraction
First, we need to extract all the protocol fields of the IEC-104 protocol. A total of 10 protocol fields have been extracted in this paper. APDU length (AL) represents the length of APDU; Data type (DT) and data byte length (DB) are obtained from the type identification field, and DT represents the type of data in the information object, DB is the byte length of data in the information object; Variable structure qualifier (VSQ) represents the number of the data in the information object; And cause of transmission (CT) represents the command type corresponding to the frame message; And the five protocol fields in the quality descriptor of the information object, namely BL, SB, NT, IV and SPI, and the detailed explanation can be found in paper [16].
In order to speed up the correlation analysis, we divide the 100,000 frames of traffic data into 10 groups. Then, the corresponding correlation coefficient matrices are calculated respectively, and finally calculate the average value of the 10 correlation coefficient matrices as the overall correlation coefficient matrix. And the correlation coefficient matrices of Spearman and Kendall are listed in Table I.
In Table I, the upper triangle part is the result of the correlation coefficient matrix determined by Spearman method, and the lower triangle part is the result of the correlation coefficient matrix determined by Kendall method. Based on Table I, we can see that the results of the Kendall and Spearman methods are roughly the same, which can verify the correctness of the correlation results. And note that the correlation coefficients of protocol fields BL, SB, and IV are all '-', the reason is that the data of these three protocol fields are constant. Besides, it can be observed that the protocol fields AL and VSQ are related; and the protocol fields DT and DB, DT and CT are related, and the protocol fields DB and CT are also related. Therefore, we can get the coupling fields in the IEC-104 protocol, which are listed in Table II.
And in coupling field 1, the data of AL and VSQ are continuous data. In coupling field 2, the data of DT, DB, and CT are discrete data. In order to verify the effectiveness of the proposed abnormal traffic detection method based on the coupling field, the above coupling fields in the 20 frames of normal traffic are modified for evaluating the accuracy rate of the proposed method.

The result of DBSCAN
Considering the speed of the DBSCAN clustering, we first preprocess 10 groups of traffic data. We only extract all non-repeated protocol field traffic data, and filtered out massive amounts of completely consistent protocol data. After the preprocessing, we get 516 sets of protocol data. Then, as mentioned in section III.C, DBSCAN clustering algorithm is used to eliminate the noise data in the discrete coupled field (coupling field 2 in this paper). After the optimal selection of multiple sets of parameters and , the clustering algorithm has the best performance when =1 and =4, so the parameters and are set as 1 and 4.
The results of the DBSCAN method are shown in Figure 4. And the size of the point reflects the number of occurrences in the total 100,000 frames of the traffic data, and the points with a small number of occurrences are screened out and recorded as noise points. According to Figure 4, it can be seen that there are 2 noise points, and the corresponding values of {DT, DB, CT} are {100, 1, 3} and {200, 1, 5}. The remaining normal coupling fields will be used in the piecewise linear function fitting, and their corresponding values of {DT, DB, CT} will also be given.

The result of linear function fitting
Based on the results of normal coupling fields determined by the DBSCAN, we extract the coupling field {AL, VSQ} corresponding to each coupling field {DT, DB, CT}, and use the least squares method to fit the linear function between protocol fields AL and VSQ. In order to reduce the influence of abnormal traffic on the function fitting results, we use iterative fitting and verification methods to obtain the final fitting function. For each coupling field of {DT, DB, CT}, we first perform function fitting on the corresponding coupling field {AL, VSQ}, calculate the average error of the fitting result, and delete the data larger than the average error in the coupling field {AL, VSQ} until that no data is deleted or the average error is less than 1, then stop the iteration and record the final fitting function. And the general formula of the fitting function is , the results of linear function fitting are listed in Table III. In Table III,

The result of abnormal traffic detection
In this section, we will use the proposed abnormal traffic detection method and the original DBSCAN method to detect the abnormal behavior of the 100,000 frames of traffic data. And the detection results of the proposed method based on the fitting functions and DBSCAN method are listed in Tables IV and V, respectively. It can be seen from Table IV that the proposed method accurately detect 20 frames of abnormal traffic, among which, the coupling field {AL, VSQ} of the 18 frames of abnormal traffic data (black font) cannot satisfy the corresponding fitting function, and the coupling field {DT, DB, CT} of the other 2 frames (red font) cannot match the normal points determined by DBSCAN.
On the other hand, it can be seen from Table V that the original DBSCAN method cannot successfully detect all abnormal traffic. And the original DBSCAN method detects a total of 6 frames of abnormal traffic, but 4 frames (black font) are false detections. Therefore, we can see that the detection accuracy of the proposed detection method is 100%, while the accuracy of the original DBSCAN detection method is only 10%. The results indicate that the proposed method can effectively utilize the deep coupling relationships in the protocol field to detect abnormal traffic, so as to overcome the disadvantage of the existing detection method that does not consider the protocol field.