Improve SegNet with feature pyramid for road scene parsing

. Road scene parsing is a common task in semantic segmentation. Its images have characteristics of containing complex scene context and differing greatly among targets of the same category from different scales. To address these problems, we propose a semantic segmentation model combined with edge detection. We extend the segmentation network with an encoder-decoder structure by adding an edge feature pyramid module, namely Edge Feature Pyramid Network (EFPNet, for short). This module uses edge detection operators to get boundary information and then combines the multiscale features to improve the ability to recognize small targets. EFPNet can make up the shortcomings of convolutional neural network features, and it helps to produce smooth segmentation. After extracting features of the encoder and decoder, EFPNet uses Euclidean distance to compare the similarity between the presentation of the encoder and the decoder, which can increase the decoder’s ability to restore from the encoder. We evaluated the proposed method on Cityscapes datasets. The experiment on Cityscapes datasets demonstrates that the accuracies are improved by 7.5% and 6.2% over the popular SegNet and ENet. And the ablation experiment validates the effectiveness of our method.


Introduction
Road scene parsing is a fundamental but challenging task in computer vision. There are mainly two main kinds of approaches: the traditional segmentation method [1,2] and the segmentation method based on deep learning. Traditional methods tend to have good results in simple scenes. However, the context of images with many vehicles and pedestrians is usually complex and changeful. The same category objects have a large difference in colour and appearance [3] . Therefore, these methods are rarely used alone in the road scene parsing. With the continuous development of deep learning, convolutional neural networks have achieved great success in image feature extraction [4][5] . Therefore, we design an edge feature pyramid module to improve the performance of the semantic segmentation model. The main contributions of our work are as follows: (1) We design a novel edge feature pyramid module for edge feature extraction. This module will help merge the manual features from different stages. Based on this way, the ability to extract edge features is enhancing, so that the model can get semantic boundary feature easily. Eventually, the model can obtain features that are more conducive to road segmentation, thereby improving segmentation accuracy.(2) For our edge feature pyramid module, a new hybrid loss function is defined. By using it, our model will improve the ability of the decoder to restore high-level semantic features and help itself to further improve the accuracy.

Related work
In 2015, Shelhamer [6] proposed a model called Fully Convolutional Network (FCN), which removes the common fully connected layer in the CNN and laid the framework for solving semantic segmentation tasks. Then Badrinarayanan et al [7] proposed an encoder-decoder model called SegNet for image segmentation in 2016. Since then, many other methods have been proposed to enhance the performance of segmentation, like ENet [8] , UNet [9] , DeepLab family of algorithms [10][11][12][13][14] , PSPNet [15] and so on.
The above methods are all based on the deep CNN. However, only relying on deep learning methods is not enough to complete changeful and complex tasks, nor can it deal with specific problems in a targeted manner [16] . So, there are some examples of fusing various operator with convolution features [17][18] . In fact, common edge [19][20] can help neural network to extract semantic boundaries contained in the edges of objects. Besides, there are also edge detection methods based on deep learning [21] .
There are already some semantic segmentation models that combine edge detection and deep learning. Marmanis et al [22] combine edge detection with semantic segmentation which uses the image after edge detection as input. On the contrary, Huang et al [23] combine the two process to improve the overall segmentation effect. Song et al [24] build a semantic boundary detection subnetwork by the method of multitask learning. However, this method will increase the difficulty of convergence of the overall network to a certain extent easily [25] . Considering about the above methods, we improve the SegNet model with a new designed pyramid module which can obtain edge features from multi-scale stages.

Proposed methods
We propose a semantic segmentation model combined with edge detection Sobel operator, the whole model is shown Figure 1. This model is mainly composed of two parts: the semantic segmentation network and edge feature pyramid module. We use SegNet as the basic semantic segmentation model. The structure of the SegNet model is shown at the bottom of Figure 1. SegNet is a network model of autoencoder type, which is composed of an encoder and a decoder. The encoder network in SegNet is topologically identical to the first 13 convolutional layers in VGG16 [26] . And the decoder network and the encoder network are completely symmetrical.

Edge feature pyramid
The feature maps obtained by each stage in SegNet are subjected to edge detection. The edge feature extraction process is same in each stage. In the module, after each convolutional layer of SegNet, a 1×1 convolutional layer is used for dimension reduction. The number of output channels is 19, which represents the number of the road categories. As the feature map size after different convolution layers under the same stage remains constant. Therefore, these features can be enhanced by simple linear addition. By this way, we get the edge information feature map of the feature map under this stage.
To reflect the edge information representation ability of the entire encoder, we need to perform a multiscale fusion of the feature maps obtained at each stage. A feature pyramid module is introduced to help feature fusion of the feature maps of each stage here. The architecture of the module is shown in Figure 2. Considering that up-sampling will lose some information and the complementarity between features will be impaired, it is not suitable for this fusion method in which feature maps are directly added. We use the 1×1 convolution to help the fusion of features of multiple scales. This operation can guarantee the preservation of the original information and obtain appropriate subsequent features.

Loss function
To be able to measure the gap between the edge feature representation capabilities of the encoder and the decoder, the model adopts a hybrid supervision method on the loss function, which is defined as Equation 1: The first term is the Focal loss [27] , and the second one is the Mean Squared Error (MSE) loss. Focal loss is a loss function proposed for the imbalance of categories. It improves the cross entropy(CE) loss function by adding a modulation coefficient before CE. Focal Loss is defined as Equation 3. Through this modulation coefficient, the learning process of the model can be more focused on the learning of hard negative examples.
For the model of the encoder-decoder structure, we believe that the final output of the decoder should be like the input as possible [28] . So, we measure the feature extraction capability of the encoder and decoder by calculating the similarity of the edge feature maps, which can make the presentation of the decoder approach the encoder. Comparing the feature expression ability of the two is a regression task, so we use the MSE loss function. The calculation formula of the MSE is shown in Equation 4, where Y represents output, and subscripts D and E represent decoder and encoder respectively.
And we add a dynamic constant coefficient alpha before MSE. This coefficient is the ratio of the current training epoch to the total epoch. The calculation formula is as Equation 5. The existence of alpha is to balance the proportion of the two loss functions in the overall loss in the backpropagation process. Since the decoder has not learned the expression of the dataset during the initial training, the expression ability is limited. With the continuous training of the model, the decoder gradually learns the expression of the segmentation feature. At the same time, the coefficient will increase the proportion of MSE loss function which stands for the edge difference in the overall loss function, to correctly supervise the training of the model.
In summary, when combining the above two loss functions, Focal Loss can be used to provide the gradient of semantic segmentation. MSE can help the decoder better acquire the edge representation ability and provide the gradient for the entire model as well, thereby improving image semantic segmentation accuracy.

Implementation details
The algorithm in this paper is implemented based on the deep learning framework Pytorch. The GPU used is two pieces of 16G NVIDIA Tesla P100. The evaluation method used in this paper is mean Intersection over Union (mIoU), which is used on the Cityscapes dataset [29] in this paper. In the training of the model, we use stochastic gradient descent (SGD) as the optimization algorithm to train all the variants with an initial learning rate of 0.001 and a momentum of 0.9. The sample batch size is set to 24, the weight decay coefficient is set to 5×10 -4 , and the maximum epoch is set to 700. For faster training, we load all the first 13 convolutional layers weights of VGG16 into the model and the remaining parameters are initialized using the kaiming initialization [30] .

Experiments results
We compare the segmentation results obtained by the method proposed in this paper with those obtained by SegNet. The comparison between the segmentation results of the two methods and the label of the dataset is shown in Figure 3.
It can be seen from scene 1 that it is difficult to distinguish the sidewalk from the road due to the presence of shadows. Our method basically recognizes the road turning right, which SegNet cannot. In scene 2, SegNet caused a misclassification due to the small truck target, while EFPNet is still sensitive to small objects. In segmentation diagram of scene 3, the light pole can be identified by our method, which is discontinuous in SegNet result picture. The detailed information about road in scene 4 is misclassifying, but our model can recognize accurately.  Besides, we compare the model EFPNet proposed in this paper with some other existing methods for specific categories. Table 1 shows the comparison of the accuracy of several methods in 19 categories on the Cityscapes dataset. It shows from the Table 1 that our method has a great improvement in most categories. Also, the categories that are sensitive to edge features have been greatly improved, which also proves the effectiveness of our proposed method.

Ablation experiment
To verify the necessity and effectiveness of the edge detection operator in our method, we conducted some ablation experiments on the Cityscapes dataset.
We remove the Sobel operator in the edge detection part of the encoder and decoder and check the performance of this model. The accuracy of model with Sobel is 65.3% and that of model without Sobel is 63.2%, which prove that the existence of Sobel operator in the entire model is meaningful. The edge detection operator is a manual features extractor, which can only extract the edge features of the entire feature map. That will be a kind of degradation of the complexity of the features. However, the undetected feature map is directly used for similarity comparison, which results that the model is difficult to converge due to the high complexity of the feature map.
We select four different edge detection operators and put them into the edge detection pyramid module and compare their effects. The results are shown in Table 2. From the above results, we can find that Sobel and Prewitt operators have similar performance, while the second-order operators perform worse. This is mainly because these two second-order operators are mainly used to filter out noise. Their edge detection capabilities are not as good as two first-order operators.  In addition, we also compare the model segmentation performance of different stages, which is treated as the basic stage in pyramid module. From Table 3, we can see that there is a certain gap in the experimental results obtained by selecting different stages as the basic scale. Among them, stage 2 and stage 4 have better effects and can be used as basic scales. Figure 4 shows the edge feature maps and corresponding heat maps obtained after different stages. We can see that because stage 1 has a larger scale and less semantic information. The low-level information of the feature map obtained is more complete, but the boundary information is not obvious enough. For stage 2 and stage 4, it can be seen from the heat map that the boundary is more obvious.

Conclusions
Modern traffic road scenes are so complex that the scales of objects in the same category vary greatly. Common segmentation methods often have problems like discontinuous edges and the difficulty of identifying small objects. To solve the above problems, we design an edge pyramid module to help model get semantic boundary features derived from feature pyramid and other methods. By using the module, our model is more sensitive to boundary features and effectively improves the ability to extract semantic edge. Finally, we validate the effectiveness of our method on the Cityscapes. The results manifest that this method effectively improves the problem of edge blur in semantic segmentation and greatly improves the accuracy of segmentation.