Improved Asian food object detection algorithm based on YOLOv5

. An improved model called TR-YOLO is employed for Asian food object detection. Firstly, the ViT module is introduced into the model to make better use of global features. Secondly, the Swin Transformer module is introduced on the three detection branches to output the features. Finally, the Mconcat feature fusion method is proposed, which enables the model to learn the feature weights to assign feature channels independently. The experimental results show that the TR-YOLO model further improves the detection accuracy.


Introduction
The detection of food objects based on machine learning is challenging.Due to the significant differences in the cooking methods and eating habits of ingredients in various Asian countries, the colour, texture and other characteristics of Asian food are more complex and rich, with different shapes and structures.These factors greatly increase the difficulty of Asian food target detection.Therefore, an effective method for Asian food image target detection is urgently needed.In 2020, Shao Hongmin et al. [1] proposed a joint recognition method using convolutional neural network and Huff transform to perform preliminary detection on images in the VireoFood172 dataset, and use Huff transform to extract and label individual foods according to the shape of the plate, Categorize food.Jiang et al. [2] proposed a Multi-Scale Multi-View Deep Feature Aggregation (MSMVFA) method for food recognition.In 2021, Bing Xu et al. [3] proposed a method combining CBAM attention mechanism module with MobileNet-V2, VGG16 and ResNet50 networks for Asian food image classification.Mengmeng Zhang et al. [4] proposed an information fusion network named Interleaved Perceptual Convolutional Neural Network (IP-CNN) to integrate heterogeneous information, and the fused food data was input into a two-branch convolutional neural network final classification in the network.
Most of the above studies simply use CNN to extract visual features for food recognition, without considering the special characteristics of food itself, so the detection performance of these methods is not satisfactory.In order to solve the above-mentioned problems in the detection of Asian food targets, further explore the correlation between the inherent characteristics of Asian food and the detection results, the YOLOv5 algorithm is improved and the improved algorithm is used for Asian food object detection, aiming to The backbone network of the original YOLOv5 is composed of a series of C3 modules and convolutional layers.A Transformer layer can model the relationship between all pixels, use the attention mechanism to capture global context information, extract more powerful features, and can perform well with convolutional layers complementary.Therefore, the improved C3TR module based on the Transformer structure and ViT structure [5] is introduced into the original YOLOv5.The structure is shown in Fig. 2.  The C3TR module divides the input image into image blocks according to a certain size through the Patch Embedding operation, then combines the divided image blocks into a sequence, and transfers the combined result to the Transformer Encoder for feature extraction.The structure of the Transformer Encoder is shown in the Fig. 3 shown.

Transformer encoder Position Embedding
Compared with the Transformer Encoder structure in the original ViT, the Transformer Encoder module used here reduces the Norm layer, thereby reducing the amount of parameters of the module and saving computing resources in the process of module implementation.Since the Transformer structure cannot be effectively learned in the shallow network, and the overhead is large, C3TR is placed in the deep and neck network in the feature extraction network.Replace the C3 modules at 8 th ,13 th , 20 th , and 23 rd layers of the feature extraction network with C3TR modules.
In order to avoid using the C3 module with a higher number of channels multiple times, and to extract features more specifically, to further increase the receptive field of the model, this paper redesigned the neck network of the model, replacing the 17 th , 20 th , and 23 rd layers of the original YOLOv5 neck network with the C3SwinTR module.This module uses the Swin Transformer [6] structure, and the improved part is shown in the green box in Fig. 1.
The schematic diagram of the Swin Transformer structure is shown in Fig. 4. The core structure of the C3SwinTR module used here includes the Windows Multi-head Self-Attention(W-MSA) and the Shifted Windows Multi-head Self-Attention(SW-MSA).The two structures are used in pairs, first using the W-MSA structure and then using the SW-MSA structure.In the C3SwinTR module, the image is divided into blocks and flattened in the channel direction, and then input into W-MSA.W-MSA divides the feature map into multiple disjoint windows, and the multi-head attention mechanism only works within each window.In the next layer, half of the window will be moved to the lower right, thus constructing the connection between the pixels of different windows in the previous layer.
Compared with ViT, which directly performs multi-head attention on global features, although the W-MSA structure reduces the amount of calculation, it also isolates the information transfer between different windows.Using SW-MSA solves the problem of no correlation between non-overlapping windows in W-MSA described above.In order to avoid the additional calculation amount caused by the increase of the window, and to ensure that the non-overlapping windows are related, this paper uses a cyclic shift method.
Before performing the cyclic shift, the sub-window needs to be encoded.In order to ensure the correctness of the SW-MSA calculation, only the parts with the same encoding can calculate the attention.
We presented a feature fusion method of Mconcat, whose structure is shown in Fig. 5. Besides, in order to match the number of channels between different layers, a fusion layer of layer 21 is added to fuse the features of layer 6 with the features of layer 20.As shown in Fig. 5, the proportion of the input channel numbers of the two feature layers is initialized, and the initial weight is ω i .Through the learning and calculation of the model, the weight of the number of input channels λ i is obtained, as shown in formula (1).The calculated weight is multiplied by the two channels and then added to obtain a new channel C. The channel C calculation formula is as follows: (2).Finally, a layer of ReLU activation function is used to map with a 1×1 convolution layer to obtain a new feature layer.

/ (
) Among them, ω i is the initial weight of the two input channels, and ε is the discount factor, the purpose is to limit the denominator of the weight formula to not be 0.

Analysis of experimental results
In this paper use the UEC-FOOD100 Asian food dataset, the dataset was randomly divided into training set, validation set and test set according to the ratio of 8:1:1.Table 1 shows the performance of the C3TR module in each layer.The experimental results in Table 1 show that after replacing the C3 modules in the 8 th , 13 th , 20 th and 23 rd layers of the original YOLOv5 network with the C3TR module, the performance of the model is improved.The experiments can show that in the neck network of the 13 th layer, the C3TR module is used to replace the C3 module, and the detection accuracy of the model is greatly improved.
In order to verify the impact of the mixed use of each module on the performance of the model, five models are studied on the same data set.The performance comparison before and after improvement is shown in Table 2.The results of experiment 5 in Table 2 show that the C3TR, C3SWinTR modules and the Mconcat feature fusion method are simultaneously introduced into the original YOLOv5 network, and the average accuracy is increased by 5.75%, which further verifies the feasibility of the improved scheme.After training the proposed TR-YOLO and the original YOLOv5 model under the same configuration for about 150 rounds, the average accuracy comparison curve is shown in Fig. 6.Fig. 6(a) is the curve when IoU=0.5, Fig. 6(b) is the average precision curve when IoU=0.75.In order to further verify the effectiveness of the algorithm in this paper, on the Asian food dataset in this paper, the algorithm in this paper is compared with 7 representative or advanced target detection algorithms SSD, Faster R-CNN, EfficientDet, YOLOv3, YOLOv4, the original YOLOv5 and YOLOX algorithms are compared and tested, and the performance comparison is shown in Table 3.
them, C 1 and C 2 are the channels of the two feature layers.The Mconcat feature fusion method can alleviate the lack of focusing on effective feature channels in the feature extraction process.Through the re-fitting and distribution of the feature information between channels by the model, the proportion and weight of the effective feature channels are increased, and it strengthens the training of the model for effective feature information.
(a) The average precision variation curve of IoU=0.5.(b) The average precision variation curve of IoU=0.75.

Table 1 .
The performance of the C3TR module at each layer.

Table 2 .
Comparison of experimental performance.

Table 3 .
Performance comparison of mainstream algorithms.In this paper, an improved TR-YOLO model based on YOLOv5 is proposed, and the model is used for target detection of Asian food.The performance of YOLOv5 algorithm introduced into ViT, Swin Transformer and Mconcat is compared and analyzed.Experiments show that the introduction of the C3TR module can effectively extract global features and improve the detection accuracy.The introduction of Swin Transformer on the three detection branches can help the model extract effective features and suppress irrelevant features.The Mconcat feature fusion method is proposed, which effectively improves the proportion and weight of effective feature channels.