A Risk Prediction Model of Hard Landing Based on Random Forest Algorithm

—Landing safety is a hot issue in civil aviation safety management. In order to fully mine the influence factors of hard landing in flight data and effectively predict the risk of hard landing, the random forest algorithm was introduced. Firstly, this paper qualitatively analyzed the influence factors of hard landing, and chose the features of the model based on the flight data. Secondly, this paper gives a quantitative analysis method of the importance of features based on Gini index. Finally, for the dataset of hard landing was class-imbalanced, the model was training based on SMOTE method. Then, the random forests classifier was built and verified by real flight data. The results showed that the recall rate of the model was 85.50%. The model can not only effectively prevent the occurrence of hard landing, but also provide a method reference for airlines to apply data mining to improve the ability of flight events management.


INTRODUCTION
Risk management is the main part of safety management system (SMS) in the civil aviation field, which advocates actively exploring and even predicting the risks in the system to improve the safety management level. Landing safety is the key content of civil aviation safety management. The statistical data of IATA from 2015 to 2019 showed that the accident rate of aircraft in final approach and landing phase was as high as 53%, and the number of hard landing accidents accounted for 23% of the total accidents in landing phrase [1] . According to the servicing manual of Boeing Company, hard landing is a flight event in which the vertical acceleration of an aircraft exceeds the prescribed threshold [2] . It could not only lead to aircraft's structure especially carriage, airfoil and so on in overload, but the human death if severely [3] . Therefore, the prediction of hard landing risk is of great significance to civil aviation safety management.
QAR can record all kinds of performance parameters, environment parameters and operation parameters in the whole flight phase [4] . At present, the diagnosis of hard landing mainly depends on QAR data, especially the vertical acceleration value when the main landing gear touching down to the ground. However, most airlines fail to do the research based on QAR data, which to some extent ignores the risk during the aviate. Civil Aviation Administration of China (CAAC) has implemented the program of Flight Operations Quality Assurance (FOQA) since 1997, with all commercial airplanes of Chinese airlines obliged to install QAR or similar equipment. Practice has proved that QAR data can help airlines improve flight safety management and quality control.
Although many researches on landing safety [5][6][7][8][9][10][11] and flight events prediction models [12][13][14][15] have been carried out, there are few researches on the analysis of hard landing accidents based on QAR data. For example, Wang et al. [16] pointed out that the vertical load of touching ground was closely related to the touchdown attitude and configuration by logistic model. Liu et al. [17] proposed an improved Bow-tie model, and analyzed the causes and consequences of hard landing by statistical methods, which provided reference for the prevention and risk control of the hard landing. The above research based on traditional mathematical statistics methods inevitably has the disadvantages of poor efficiency and low prediction accuracy of QAR data mining. In recent years, machine learning technology has been gradually applied in the field of civil aviation safety management system to ensure flight safety. Some researchers have studied hard landing using QAR data mining based on machine learning. For example, Chen et al. [18] preprocessed QAR data by correlation analysis and factor analysis, and established SVM model to predict the hard landing. Qiao et al. [19] proposed a new method of hard landing prediction based on RBF neural network with K-means clustering algorithm. Chau et al. [20] developed a deep prediction model based on Long Short-Term Memory (LSTM), and predicted the risk of hard landing based on QAR data.
Random forest is a popular machine learning algorithm, which can be used to develop prediction models [21] . Random forest uses randomly selected training datasets to construct many classification and regression trees, and forecasts by summarizing the results of each tree. Therefore, random forest usually provides higher accuracy compared with the single decision tree model and maintains some useful qualities of the tree model (e.g., ability to interpret relationships between predictors and outcome) [22] . In the setting of classification, random forest always provides higher prediction accuracy compared with other models [23] . One of the main benefits of using random forests for predictive modeling is the ability to handle datasets with many features. In conclusion, the risk prediction of aircraft hard landing based on QAR data is suitable to be regarded as a binary classification problem by using random forest algorithm.
In this paper, the random forest algorithm was applied to the risk prediction of hard landing, and the importance analysis method for the influencing factors of hard landing was given. Meanwhile, the random forest algorithm based on SMOTE method to improve the imbalanced datasets of hard landing were given to realize the risk prediction. This paper could provide a method reference for airlines to improve the level of risk management based on flight data mining.

Selection of features
The premise of risk prediction is to select the features that influence the hard landing from QAR data. Referring to SHEL model [24] , the features of hard landing are determined from the perspective of " Human-Equipment-Environment ". Firstly, the human factors mainly include: the control of altitude and speed in the sliding stage; the control column and throttle operation in flare; the flare time and final flare pitch angle. Secondly, for the same aircraft type, the parameters that affect the hard landing are mainly divided into three aspects: attitude, speed, and weight. Finally, the environmental factors that may affect the risk of hard landing include airport's elevation, atmospheric temperature, visibility, wind shear, icing and so on.

QAR Data Collection and Preprocessing
The 128 cases of QAR data in this study were collected from the Boeing 737-800 fleet of a local airline. The original data is a CSV (Comma Separated Value) file with thousands of rows and columns. Therefore, Python programing functions in Microsoft Excel was applied. According to the FOQA [25] , the threshold of determining hard landing for this aircraft type was set as 1.6 g in this study. Landing samples was extracted in 128 cases at 1 Hz below 50 feet and we totally got 1401 samples including 23 hard landing samples. In principle, random forest algorithm is not sensitive to the unit and dimension of data, so it does not need to normalize the sorted data.

The Basic of Random Forest
Random forests classifier (RFC) uses the Bootstrap method to form different training sets, train decision trees respectively and vote to form the result. RFC algorithm steps are as follows： (1) Generate decision trees 1. The Bootstrap method is used to randomly select k samples from the datasets to form the training set T; 2. Output each group of data in the training set with tree structure: a. m attribute features are randomly selected from n attribute features; b. The Gini index of m attribute features is used as the standard for the best node of single tree; c. Repeat the previous step to divide the dataset into binary trees until the tree nodes are pure.
(2) The set H formed by k decision trees is called the random forest.
(3) The classification results of all decision trees in the random forest are counted, and the voting result of the trees is the final classification result of the random forest.

Processing method imbalance datasets
The existing research seldom considers the problem of imbalanced datasets for hard landing samples in real QAR data, which will directly affect the prediction accuracy of machine learning algorithms such as random forest. SMOTE (synthetic minority oversampling technique) is an oversampling technique for synthesizing minority classes. It is an improved scheme based on the random oversampling algorithm. Because the random oversampling adopts the strategy of simply copying samples to increase a small number of samples, it is easy to be over-fitting. The basic idea of SMOTE algorithm is to analyze the minority samples and add new samples to the datasets according to the minority samples. The algorithm flow of expanding the hard landing samples is as follows: (1) For the hard landing sample x, the distance from Euclidean distance to all samples in a few sample sets is calculated, and its k-nearest neighbor is obtained.
(2) Set a sampling scale according to the imbalanced proportion of the hard landing sample to determine the sampling ratio n. For each small sample x, select several samples randomly from its k neighbor, assuming that the nearest neighbor selected is x n .
(3) For each randomly selected neighbor x n , the new samples are constructed according to formula (1) with the original sample respectively:

Importance analysis of features
In order to quantitatively describe the importance of hard landing features, the average change of Gini index is used to measure the feature importance for RFC. The larger Gini index is, the higher the feature importance is. VIM is used to represent variable importance measures, and GI is Gini coefficient. Suppose there are c features , , … , . Now we need to calculate each feature's Gini index score ( ), that is, the average change of node branching purity in all decision trees of RFC. The calculation method of Gini index is shown in formula (2): Where K is the number of categories and P mk is the proportion of category k in node m.
The importance of feature X j in node m, that is, the variation of Gini index before and after node m branching, is shown in formula (3) Where, GI l and GI r is the new Gini index after node m branching. If the nodes of feature X j appearing in decision tree i are in set M, then the importance of X j in the tree i is shown in formula (4)： So far, the feature importance ranking of c features can be obtained to assist airlines to control the risk of hard landing purposefully.

Calculation of model parameters
Random forest algorithm needs to determine the optimal combination of two parameters, namely, the number of decision trees(n tree ) in random forest and the number of features (n try ) for branching from all the features in the process of generating the subtree of random forest. At the same time, Breiman research shows that random forest can often get better results under the default parameter setting [21] .
The parameter n tree is the number of trees in the random forest. The generalization error of the random forest will converge to an upper bound with the increase of n tree , so that the random forest will not over fit. Therefore, the goal is to choose a large enough n tree to make the OOB error tend to be stable. The n tree should not be too large, too many trees will increase the training time of the model. First, n try is set as the square root of the number of features, then n tree of different groups is selected to establish a random forest model. Finally the trend of OOB error of each group is observed to determine the value of n tree .
The parameter n try is the number of features selected randomly each time when the tree branches. Random selection of features makes the difference between trees larger, and improves the noise tolerance and generalization ability of the model, so the selection of this parameter is very important. Based on the determination of n tree , the optimal n try is selected to minimize the OOB error by enumeration.

Training and verification of the model
Based on the balanced QAR datasets improved by SMOTE, the optimal n tree and n try are obtained, the RFC is established. Breiman has proved that the accuracy of using OOB error to verify the model is the same as using the test datasets [26] . The prediction accuracy P, recall R and comprehensive evaluation index F 1 of RFC are calculated to verify the availability of the model. The calculation formula of P, R, and F 1 is shown in formula (7): Where TP indicates that the positive class is predicted as positive class, FP means that the negative class is predicted as positive class. P is the proportion of the real hard landing samples divided into hard landing samples; R means the classification accuracy rate of the actual hard landing samples; F 1 is the comprehensive evaluation index of the model, and the higher P, R and F 1 , the more effective RFC is.

EXPERIMENTAL ANALYSIS
Based on the balanced datasets obtained by SMOTE method, the proportion of normal landing and hard landing samples was 52.39% and 47.60% respectively.
The importance of features calculated according to the equation (6) is shown in Figure 1: The results show that the vertical speed, pitch angle, control column position, gross weight and pitch rate have great influence on the hard landing risk. Airlines can control the risk from the above five aspects and reduce it to an acceptable level such as improving flight training methods.
The preset parameter n try of the model was the square root of the number of features, that is, n try = 3. The RFC was established by setting n tree of different groups. The OOB error of each group is shown in Figure 2. When n tree = 500, the OOB error of the model tends to be stable. On this basis, the change of model accuracy under different n try values is shown in Figure 3. When n try = 3, the model OOB error is the smallest. So, the optimal parameters of the model are n tree = 500, n try = 3. The relationship between ntree and OOB error  Taking the real landing process of a certain flight as an example, the prediction results of the hard landing risk category are illustrated in table 3. The hard landing risk is marked as "high", and the normal landing risk is marked as "low". After 50 feet the hard landing risk was "low"; in the 6th second, it turned to "high" and the warning lasted for two seconds until the aircraft touched the ground. Compared with the real results, the risk prediction of hard landing of the flight is misjudged in the 7th second, and the overall prediction effect of the model is acceptable.

DISCUSSION AND CONCLUSION
In order to predict the risk of hard landing scientifically, a prediction model based on random forest classifier was constructed, including the selection method of features, the improvement method of imbalanced datasets, the importance analysis method of features, and the construction method of the model. Based on the real QAR data, the training and validation of the model were completed. The results showed that the recall rate of the model was 85.50%, which had a acceptable ability to 11.5