Research status of damage identification algorithm based on deep learning

. One of the core tasks of computer vision is target detection. With the rapid development of deep learning, target detection technology based on deep learning has become the mainstream algorithm in this field. As one of the main application fields, damage identification has achieved important development in the past decade. This paper systematically summarizes the research progress of damage identification algorithm based on deep learning, analyzes the advantages and disadvantages of each algorithm and its detection results on voc2007, voc2012 and coco data sets. Finally, the main contents of this paper are summarized, and the research prospect of deep learning based damage identification algorithm is prospect.


Introduction
Damage identification is to locate and identify the damaged target in the image. However, due to the influence of background noise, target morphological diversity, target occlusion, illumination intensity and image resolution, the successful completion of the target detection task is a great challenge [1]. In order to overcome the difficulties in the current target detection task, many excellent scholars have devoted themselves to the research in this field.
The traditional target recognition algorithm is mainly divided into four stages: 1) selecting candidate regions and generating candidate frames on the target image through a fixed size sliding window; 2) feature extraction using SIFT (scale invariant feature transform) [2][3], hog (histogram of oriented gradient) [4][5] and other traditional feature extraction methods to extract features of candidate regions; 3) feature classification, using DPM (deformable part model) [6], AdaBoost [7], SVM (support vector machine) [8] and other classifiers are used to classify features [9]; 4) NMS)(non maximum suppression) [10] and other methods are used to modify and optimize the results. However, the traditional target detection algorithm does not select the sliding window according to the characteristics of the target, resulting in a large amount of time cost.; and the manually designed feature extraction method is only good for the recognition task with small changes in the background and light, and is not suitable for the target recognition task with large changes in background and light; the traditional classification algorithm is not suitable for the target recognition task with large changes in background and light It is easy to be disturbed by noise and has poor robustness, so it is difficult to meet the requirements of social target detection.
In 2012, the alexanet [11] model proposed by Hinton et al In (ImageNet Large Scale Visual Recognition Competition)competition, it won the championship by more than 10% of the second place. Since then, deep learning has been paid more and more attention in the field of computer vision [12], and has promoted the rapid development of computer vision, and various improved algorithm models have been endless. Deep learning transforms input data into higher-level features through nonlinear model, which has strong learning ability and generalization ability. The research of deep learning also promotes the development of object detection, and object detection based on deep learning has become one of the key research directions in the field of computer vision. According to the different implementation methods, the damage target detection algorithm based on deep learning can be divided into two schools: two stage detection and one stage detection, which correspond to the damage target detection algorithm based on candidate region classification and the damage target detection algorithm based on regression. In this paper, we will systematically introduce the research progress of these two kinds of damage target detection algorithms, analyze the related algorithms systematically, and summarize and prospect the research on damage target detection based on deep learning.

DAMAGE TARGET DETECTION ALGORITHM BASED ON CANDIDATE REGION CLASSIFICATION
Firstly, the candidate region is selected from the input image, and then the feature extraction and position calibration of the candidate region are carried out by using the convolutional neural network. The main representatives are R-CNN,SPPNet,Fast R-CNN,Faster R-CNN,R-FCN,Mask R-CNN,IOU-NET,and D2Det.

R-CNN
Ross 2014 Girshick et al. Proposed R-CNN model [13]. In order to solve the problem of window redundancy, the model uses selective search algorithm [14] to replace sliding window, and then scales the candidate region into a fixed size image, then extracts features by convolutional neural network, classifies the target class information of candidate area by SVM classifier, and sends it to full connection layer network for position calibration. The algorithm uses convolutional neural network to replace the traditional manual design of feature extraction part, which can extract image features more effectively. However, because R-CNN needs to extract features from each candidate region, the cost of detection is high. In addition, the scaling process of candidate regions will lead to the loss of some useful information in the image, which will affect the final detection results.

SPP-Net
In order to solve the defects of R-CNN network, he et al. Proposed spp net network [15] in 2015. This network only runs convolution layer once for the whole image to obtain feature map, which avoids repeated feature extraction operation in R-CNN network and greatly saves operation time. Then, fixed length feature vector is extracted from feature map by spp layer. Compared with R-CNN, the detection speed of the algorithm is 24-64 times faster than that of R-CNN; at the same time, due to the existence of SPP layer, there is no need to zoom the image, which reduces the loss of image information. But in spp net, each training step is still independent and the training cycle is long.

Fast R-CNN
In 2015, girshick et al. Proposed fast R-CNN model [16], improved spp to ROI (region of interest) pooling layer, and introduced multi task learning mode to unify multiple steps into a convolutional neural network, which greatly improved detection speed and accuracy.
Compared with spp net, fast R-CNN has significantly improved the speed and accuracy, but the algorithm still uses the selection search algorithm when extracting candidate frames, and the detection speed still can not meet the real-time requirements.

Faster R-CNN
In view of the waste of time caused by the use of search algorithm to select candidate regions, Ren et al. Proposed the faster R-CNN [17] model in 2016. The model combines candidate region feature selection, target frame fine-tuning, classification and position regression, and realizes the end-to-end training completely.The number of candidate frames is reduced by nearly nine tenths by using hand-designed anchor frames. In this way, the quality of candidate frames is improved, and the speed and accuracy of damage target detection have also been improved. However, fast R-CNN will lose a lot of detail information after multiple down sampling, which leads to the general detection effect of the model for small targets.

R-FCN
In 2016, in order to solve the problems in fast R-CNN, Dai et al. Proposed r-fcn [18]. Fast In R-CNN, the full connection layer is replaced by convolutional neural network for classification and regression, which greatly reduces the parameters: at the same time, Dai et al. Constructed a set of position sensitive score maps to overcome the translation sensitivity problem in damage target detection. The experiment shows that the network can achieve 79.5% map on the voc2007 data set, but the network still needs a lot of computation and detection speed It can not meet the real-time requirements.

Mask R-CNN
In 2017, he et al. Proposed Mask R-CNN [19] on the basis of fast R-CNN, and added mask prediction branch in Mask R-CNN network for instance segmentation. In addition, in terms of feature extraction, the architecture of feature pyramid network (FPN) [20] was adopted, and ROI align pooling layer was used instead of ROI pooling layer. The network improves the detection accuracy and the recognition of small objects is more accurate, but the detection speed is still not significantly improved.

TridentNet
In 2019, in view of the traditional multi-scale detection problem, Li et al. Proposed a three branch network tridentnet [21]. In this network, the influence of different receptive fields on the detection results was verified for the first time, and the conclusion was that the detection of large objects by large receptive fields is more accurate, while that of small receptive fields is more accurate. Tridentnet uses RESNET as the basic network. There is no change in the first three stages. In the fourth stage, the receptive field network will be parallelized. The three parallel branches use hole convolution with different number of holes. The receptive field corresponds to the detection of small, medium and large targets from small to large. Compared with the previous algorithm, the network solves the problem of multi-scale detection better, and at the same time, three branches share the weight, which effectively reduces the risk of parameter over fitting. However, with the increase of network structure, the amount of computation is still very large and cannot be detected in real time.

D2Det
In 2020, Cao et al. Proposed a new two-stage detection algorithm d2det [22]. Its main advantage is that it can solve the problems of accurate positioning and accurate classification at the same time. In this model, a dense local regression is introduced, which can predict multiple dense frame offsets of an object, which is different from the traditional regression location and key point based location used in traditional two-stage detector. It is not limited to a set of quantized key points in the region, and can regress the position sensitive real dense offset, so as to achieve more accurate positioning In order to reduce the influence of the background area on the target, the binary overlap prediction strategy is applied to improve the precision of dense local regression; the ROI pooling layer for discrimination is introduced in the model, and the samples extracted from each sub region are weighted adaptively to obtain the discriminant features, which greatly improves the performance. Table 1 shows the relevant evaluation indexes and their  meanings, table 2 shows the performance comparison of some damage target detection algorithms based on candidate regions, and "-" indicates that there is no relevant data. As can be seen from table 2, with the continuous improvement of the algorithm, the accuracy of the damage target detection algorithm based on candidate region classification is continuously improved, but the detection speed does not improve correspondingly. In order to improve the detection speed, target based detection algorithm has been widely concerned.

Yolo series
In 2016, Redmon et al. Proposed yolov1 [23], The network overcomes the problem that feature extraction in two-stage damage target detection needs a lot of time. The whole image is divided into grid by grid prediction, and then the grid with the center of the target is taken as the prediction grid. In this way, the complex operation of generating candidate regions is avoided, and the detection speed is significantly improved. However, since the detection area is the whole image, it will increase Adding the number of background classes reduces the detection accuracy, especially when there are multiple target centers in a grid at the same time, it will cause positioning error, and only one target will be detected.
In 2017, Redmon and others proposed yolov2 [24] on the basis of yolov1, using the batch normalization (BP) [25] operation, through which the input of each layer of network can be guaranteed to obey the same distribution, It replaces the dropout method [26]; on the other hand, it uses the high-resolution images as the training set, and uses the anchor method in fast R-CNN for prediction.
In 2018, Redmon and others proposed an improved version of yolov3 [27]. This model takes darknet-53 as the network backbone. Compared with resnet-101, darknet-53 can achieve considerable accuracy, but its speed is greatly accelerated. The model consists of 53 convolution layers, and can be detected on three different scale feature graphs by using feature pyramid structure, which effectively improves the detection effect of neural network for small targets.
In 2019, Choi et al. Improved the y0l0v3 network and proposed Guassian yolov3 [28], adding the dimension of yolov3 boundary box coordinate output, which can output eight dimensions of coordinate information, and optimize the network loss function. Finally, when the network is detected on the Kitti dataset, the accuracy rate is improved by 3% compared with the yolov3 model.
In 2020, bochkovskiy et al. Proposed yolov4 [29]. In order to increase the receptive field, spp module was used for reference in csparknet53 backbone network, and a variety of commonly used algorithms were combined. Such as self confrontation training (SAT), cross small batch Standardization (cmbn), cross phase partial connection (CSP) and dropblock regularization. Through these optimization techniques, the model achieves the highest accuracy of damage target detection, and realizes the perfect matching of detection accuracy and speed.

SSD series
In 2016, Liu et al. Proposed SSD [30], This model refers to yolov1 model, and uses different size prediction frames to detect targets in different feature layers, predicts small-scale targets in shallow feature maps, and predicts large-scale targets in high-level feature maps by using prediction frames of different sizes, which makes better use of the characteristics of different feature layers with different information emphases and improves the performance of small-scale targets The detection effect of. However, because the features of each layer are input separately, the same target will be detected many times. Due to the lack of semantic information of shallow feature map, SSD still can not meet the requirements for small damage target detection.
In 2017, Cheng et al. Proposed dssd [31], whose biggest improvement is to introduce context information into damage target detection. The backbone network of dssd is replaced by resnet-101, and the residual module is introduced. At the same time, the context information is introduced by anti convolution layer. The model can improve the expression ability of the shallow feature map, and improve the detection accuracy of small targets. However, due to the increase of network layers, the detection speed is not as fast as SSD.
In order to reduce the number of parameters in SSD, Shen et al. Proposed dsod [32] referring to the idea of densenet. The model improves the input of some feature layers and can train data directly from zero without pre training model, and the detection effect of damage target is improved.
In 2017, Li et al. Proposed FSSD [33], which improved the FPN and constructed a network framework for feature fusion. Firstly, feature graphs of different levels were combined, and then new feature map combinations were generated. Finally, the results were obtained through the final prediction network. The model improves the detection accuracy of damaged targets and maintains a good detection speed.

RetinaNet
In 2017, Lin et al. Compared the one-stage detector with the two-stage detector, and found that the sample class imbalance of the first-stage detector in the training process greatly reduced its accuracy, so they proposed retinanet [34]. In this model, the standard cross entropy loss function is reshaped, and a new loss function named "focus" is introduced. The loss function redistributes the weight of samples to be classified, reduces the weight of easily classified samples, and increases the weight of difficult samples. In this way, more attention can be paid to difficult samples in training, so as to improve the classification accuracy. So that the detector will pay more attention to the difficult samples in the training process. The main feature of RESNET FPN is that it outputs two subnets in each layer of the feature pyramid, which are used for classification and anchor frame positioning and regression respectively. Experiments show that retinanet not only retains the advantage of one-stage detector in speed, but also surpasses the twostage detector in detection accuracy.

EfficienDet
In order to solve the problem of fast detection speed and low detection accuracy of one-stage detection method, Tan et al. Proposed efficiendet [35] in 2019. The model takes efficiennet [36] as the backbone network, and bidirectional feature pyramid network (bifpn) is used for feature network, which can repeatedly perform bidirectional feature fusion. At the same time, the method of joint scaling is introduced, in which the idea of weighting is applied. The joint scaling can uniformly scale the depth, width and resolution of backbone network, feature network and frame class prediction network at the same time, so as to achieve the optimal effect. The parameters of eficiendet model combined with these methods are only 1 / 4 of the optimal model parameters at that time, and the detection speed is increased more than 3 times.

CentripetalNet
In 2020, Dong et al. Proposed centripetalnet [37]. The model can effectively solve the problem of key point matching error when detector is based on key point detection. It can predict the corner position and centripetal displacement of the target, so as to match the corresponding angle. This method can match the angle more accurately than the traditional embedding method. At the same time, a cross star deformable convolution network is proposed innovatively, which improves the feature adaptive ability of the network. After testing, the performance of the model is very good on coco data set. AP surpasses most of the existing non anchor frame detectors.

Summary
According to table 3, the detection speed of the damage target detection algorithm based on regression is significantly higher than the damage target detection algorithm based on candidate region classification introduced in the previous paper, and the detection accuracy is also constantly improved, and the gap with the latter is smaller. From the current research, the damage target detection algorithm based on regression has a broader development prospect.

Summary and prospect
In this paper, we combined with the domestic and foreign development status and studied the development of deep learning based damage target detection algorithm. We introduced from the candidate region classification and regression based two aspects, and compared the advantages and disadvantages of each algorithm . In addition, We analyzed the main characteristics of each model: the algorithm based on candidate region classification can achieve high detection accuracy, but because of the excessive network parameters and the dependence on the generation of candidate regions, the algorithm can not meet the realtime requirements in the detection speed; and the detection algorithm based on regression effectively solves the problem of low detection speed, simplifies the damage target detection to the end-to-end training process, and improves the detection efficiency. Moreover, in the process of continuous optimization of the algorithm, this kind of algorithm detection accuracy The degree of detection is higher and higher, especially for small damage target detection. Damage target detection has developed rapidly in recent years, and has made some achievements, but there are still many breakthroughs to be made: (1) at present, the model network model is more complex, which puts forward higher requirements on the performance of GPU and the size of data set. In order to reduce or avoid network redundancy, people should optimize the network structure and design a more compact and portable model [38]. (2) When the data set is insufficient, the performance of the damaged target detector will generally decline, and it is prone to under fitting. Therefore, the ability of the detector to learn from a small number of samples should be improved. (3) To realize the automation of damage target detection and optimize the automatic design of backbone network, automatic neural architecture search depth learning method can be used to replace manual feature calibration (4) To realize the intelligent detection of damage targets, explore and design a new network model, so that the detector can intelligently learn new objects, locate and identify new object categories that have not been learned before .