Food Image Detection System and Calorie Content Estimation Using Yolo to Control Calorie Intake in the Body

. Excess calories in the body can cause obesity and several degenerative diseases, such as diabetes mellitus, heart disease, stroke, hypertension, and others. This system helps people maintain a balanced calorie content that enters the body. The research designs this system using the YOLO algorithm model to detect the type of food which is then developed using the Python programming language to estimate the calories of the detected food. YOLO uses the principle of feature extraction in images that are processed through filters as arrays to perform detection. This system calculates the food calories estimation by multiplying the calories for each food by the amount according to the type of food detected. The calorie value of the food provided is based on the number of calories for each portion of food taken from FatSecret Indonesia. The result is that food detection performance is quite good with average precision, recall, and F1-score values of 0.94, 0.90, and 0.91 respectively, when testing the model. However, when tested on Hugging Face, the performance decreased with the average values of precision, recall, and F1-score respectively, namely 0.84, 0.32, and 0.41. This decrease in performance is because of poor CPU usage and a decrease in image quality when uploaded to the Hugging Face.


Introduction
Excess calory in the body is a critical condition because it can lead to obesity and several degenerative diseases, such as diabetes mellitus, heart disease, stroke, hypertension, and others.This problem is increasing rapidly in developing countries, such as Indonesia.Based on data got from Riskesdas (basic health research), the trend of increasing excess calories in the body in adults is 10.5% in 2007, 14.8% in 2013 and 21.8% in 2018 [1].
The way to prevent excess calories in the body is to regulate a healthy diet by limiting the need for sufficient calories and exercising frequently.A healthy diet can be done by consuming foods that are low in energy, have enough vitamins and minerals, contain lots of fiber, it is advisable to choose foods that contain little fat and carbohydrates [2].
The Regulation of the Minister of Health of the Republic of Indonesia Number 28 of 2019 concerning the recommended nutritional adequacy rate for Indonesian people contains the government's vision to create a healthy Indonesian society by providing the recommended nutritional adequacy rate.The government has advocated for the Indonesian people to live a healthy life by regulating adequate nutritional intake that is needed by the body.This is in line with this research about helping realize the government's vision of creating a healthy Indonesian society [3].
Given these problems, this paper describes a system design study that quickly calculates the estimated calories, namely "Food Image Detection System and Estimation of Caloric Content Using Deep Learning to Regulate Calorie Intake in the Body".In summary, the design of this system uses machine learning to detect and estimate the calorie content of food.

Calories
Calories are a measure of the energy in food that humans need to move and even survive.However, excessive calorie intake consumed by the human body can be fatal.For example, diabetes and obesity are examples that often occur in today's society.The average man needs 1,600 -2,650 calories and the average woman needs 1,400 -2,250 calories [3].Caloric needs in humans depend on daily activities because this relates to how much energy humans use in daily activities, height, weight, and other factors, such as age and gender.Calorie is abbreviated as "cal" [4].The major cause of obesity is the body's inability to balance the number of calories absorbed into the body with the number of calories released by a person's body.Therefore, this can cause an imbalance of calories in the human body, both lacking and excess calories [4].

ICIMECE 2023
https://doi.org/10.1051/e3sconf/202346502057E3S Web of Conferences 465, 02057 (2023) YOLO is an acronym for you only look once, which is one algorithm from CNN development and one of the deep learning-based object detection models.The development of YOLO began by Joseph Redmond, and later, Alexey Bochkovskiy, Chien Yao Wang, and Hong Yuan Mark Liao contributed to its further development [5].
The YOLO algorithm has advantages such as accurate real-time object detection, easy installation compared to other models, and being an open source for customization and commercial use. Figure 1 illustrates the YOLO structural model [6].The YOLO architecture uses the concept of a single-stage object detection method.The idea of object detection in this one stage is to only see the image once.In this onetime image reading stage, there are four block stages, namely input, backbone, neck, and dense prediction.The following describes the four stages of the YOLO architecture [7].
Input.At this stage, the system will resize the visual image in pixels based on resolving the input layer.If the pixel value is divisible by 32, you can resize the resolution.The input image that is accepted is a color image or an image with 3 channels.
Backbone.Backbone is the core backbone of the algorithm, which refers to the architecture that performs feature extraction.An example of a backbone that is commonly used in the YOLO algorithm is the cross stage partial connection (CSP).CSP divides the input feature map into two parts, namely dense net and CSP dense net.
Neck.This stage aims to provide layers of stages between backbone and dense prediction.With this block, it can perform the object detection process on objects of various sizes.
Dense prediction.This stage is the determination of bounding boxes and the classification of objects detected in the class specified in each bounding box.The way this stage works is by dividing the input image into grid cells and each cell has an anchor box that will make predictions on an image.The prediction results on the anchor box allow for overlapping.NMS (non max suppression) solves this problem.NMS works by calculating the IoU (intersection over union) between the anchor box and the ground truth.The result taken is the anchor box with the largest IoU value.

Mean Average Precision (mAP)
Mean average precision is a matrix used to evaluate an object detection model.The mAP value ranges from 0 to 100.The higher the mAP number, the better the system quality.To calculate the mAP value, we calculate the average precision separately for each class, and then average the precision values for each class [7].

Intersection over Union (IoU)
Intersection over union is the ratio of the overlapping area between the two bounding boxes detected by the detection and the ground truth bounding box.The threshold value that is often used is 0.5.Hence, researchers commonly denote it as mAP 0.5 when measuring its accuracy [8].

Confusion Matrix
This paper uses the confusion matrix as a research parameter that can measure the performance of the detection model.This parameter is usually applied to binary classification and multi-class classification.In these parameters, there are several terms which are explained as follows.True Positive (TP).TP is the number of positive samples that are correctly and accurately predicted.If the prediction results match and the IoU value is ≥ 0.5, then we consider them as TP when using a mAP of 0.5.
True Negative (TN).TN is the number of correctly predicted negative samples.TN cannot be used for object detection because it is irrelevant if used.
False Positive (FP).FP is the number of samples whose original value is negative but is detected as positive.If using a mAP of 0.5, then the prediction results will be considered FP if the prediction results match but the IoU value < 0.5.
False Negative (FN).FN is the number of samples whose original value is positive but is detected as negative.If using mAP 0.5, then the predicted results will be considered FN if no results are detected and if the detection is appropriate but the IoU value < 0.5.

Precision
Precision is a measurement standard that is used to calculate the performance accuracy of the results of object detection.The precision value can be obtained from the calculation of the true positive value divided by the sum of the true positive and false positive values.For more details described in the following equation [9]:

Recall
Recall is a measurement value to find out the number of ground truth objects that can be detected by the system.The recall calculation can be done by comparing the quotient of the number of true positives with the sum of the true positives and false negatives by dividing the true positive values by the ground truth.For more details described in the following equation [9]:

F1-score
F1-score is the average comparison value of precision and recall calculations that have been weighted.F1score has a maximum value of 1.0 and a minimum value of 0. The model can be said to be good in terms of precision and recall if it has an F1-score value of 1 or close to 1.However, if the F1-score has a value of 0 or close to 0 then the model bad to say.How to find the F1score value is explained in the following equation [9]:

Python
Python is a versatile programming language that has a high level of capability and capability by focusing on very clear code syntax.Python was first created in 1991 by a programmer from the Netherlands which was then developed today.The development of Python can be said to be quite rapid by adding various kinds of syntax for its use so that it has a very large and comprehensive functional library to date.The Python programming language can also be said to be a language that is quite easy to understand in terms of writing scripts compared to other programming languages [10].

Research Methods
In this study, several methods were used to collect data used as information in the preparation of this study.The research method diagram is presented as shown in Figure 2.

Calorie Data Retrieval
Table 2 shows the food list.Food image calorie data retrieval in this study uses the assumption that Indonesian food portions are taken from information on calorie data from FatSecret Indonesia https://www.fatsecret.co.id/kalori-gizi/.All calorie data per serving of the types of food that will be used for this study are shown in Table 3. tempeh and 20 image data for each class ½ apple, ½ orange, and ½ boiled egg.

Image Annotation
At this stage an annotation or class name labeling is carried out on each object in the image with the appropriate bounding box.This image annotation process produces a file with the extension "txt" or "text document".Inside the file there is some information about the image that has been annotated, such as the class name, the x and y center coordinates of the bounding box, and the length and width of the bounding box.

Dataset Augmentation
The image augmentation stage aims to increase the dataset with a varied amount of data by using tools in Roboflow.
This stage using 7 ways of augmentation, namely flip, saturation, exposure, blur, noise, cutout, and mosaic.
Flip augmentation process is done in two ways, namely flip horizontally and vertically.Flip horizontally is to reverse the position of the image to the horizontal line, while flipping vertically is to change the position of the image to the vertical line.
Saturation augmentation is done by changing the saturation value in the image, which means reducing and increasing the level of color intensity in the image.In this study, the image saturation value was changed by -25% and +25%.
Exposure augmentation is done by changing the exposure value in the image, which means reducing and adding light intensity to the image.In this study, changing the image exposure value was -25% and +25%.
Blur augmentation is carried out to provide several very diverse conditions regarding the level of clarity in shooting later in the test.In this research, we changed the blur value by 2.5 px.
Noise augmentation is done by adding unwanted color dots in an image.This noise is also done to provide several very diverse conditions later during testing.In this research, the addition of noise is 5% of the pixels.
Cutout or grid mask augmentation is done by adding several black squares to an image.This is done to force the system to study the image by not displaying the object completely.In this study, 5 boxes were cut out with a size of 15% each.
Mosaic augmentation is done to add dataset variation by combining several images into one.The aim is to provide conditions for testing objects that have many and different classes in one frame/image.

Dataset Split
The next step is to connect Google Colaboratory with Google Drive storage to synchronize between projects on Google Colaboratory and storage on Google Drive in the email used.This is done to access data that contains data on Google Drive, which will later be split and trained, such as image data and data with the "yaml" extension that were previously uploaded to Google Drive storage.
Before conducting dataset training using the YOLO v8 model, a dataset split was performed to divide the dataset into training and validation data.The dataset split process was carried out by 80% for training data and 20% for validation data, resulting in 1344 training image data and 336 validation images from 1680 datasets.The split data is then placed in the data folder.

Dataset Training
The next stage is to conduct image data training using Google Collaboratory.The data training process uses the YOLO model with the version "yolov8m.pt".Model training is carried out to train each image data to recognize the specified objects.Model training in this study uses the value of epochs = 100 with batch = 8 and adjusts the image size to 640x640 as shown in Table 4.5.The results of the training model use "yolov8m.pt"as a file "best.pt"which is a training result file from the YOLO v8 model.

Calorie Estimation Calculation
Calculation of estimated calories in the detected food is done by setting the number of calories for each food in the program code.Calculation of total estimated food calories uses the principle of multiplying the number of calories in each type of food by the number of objects detected as that type of food.For example, is in Figure 3 there are white rice objects which have 204 cal calories, fried chicken 239 cal, and fried tempeh 34 cal, so the results of the total calories detected are the total number of each calorie of the object detected which is 477 cal.stacked or slightly covered by other objects, the system can detect food objects properly.This is influenced by the number of datasets being trained and the augmentation stage can also play a good role in this case, especially in the cutout type augmentation, which removes some of the food image so that the system can detect it properly.

Model Performance Testing
Model performance testing is carried out to test the accuracy of performing the model resulting from training for detection.The test was carried out using 100 sample images taken from image data that differ from the training data.Testing this model was carried out using the Pycharm IDE.Example of model testing result is shown in Figure 3.
The results of model testing are illustrated in Table 4 using the TP, FP, and FN values.In the confusion matrix table, there are still some tests that are not quite right in the process of detecting objects.For example, in the omelet detection test, there are still many wrong object detections.For example, if the actual value is an omelet object but the system detects it is not an omelet or does not detect it and if the actual value is not an omelet but the system detects an omelet.These are respectively referred to as false negatives (FN) and false positives (FP).However, there are also several classes whose detection results are quite good, for example, apple.Even though there is still one FP value because the orange object is detected as an apple.

Table 4. Confusion Matrix Model Testing
The TP, FP, and FN values can be further developed to determine the precision, recall, and F1-score values, as shown in Table 5.These values are obtained by calculating each precision, recall, and F1-score equation.
In Table 5 it can be concluded that the average class has obtained a fairly good performance with an average value of precision, recall, and F1-score respectively, namely 0.94 or 94%, 0.90 or 90%, and 0.91 or 91%.Data from the test results for object detection with the highest F1-score value were obtained by apple and half apple data with an F1-score value of 1 or 100%, while the lowest F1-score value was obtained by the omelet test with a value of 0.77 or 77%.This is likely to occur because in the apple dataset there are many apple objects in an annotated image so that they can be data for better training, whereas in the majority of the omelette dataset there is only one omelette object in an image so there is only one annotation which causes a lack of data for the training process.Another thing that is possible to happen is because the apple testing factor is not too diverse so that the model can detect objects easily, while the omelet has test data that has too many conditions, so the test results are not good enough.

Website Performance Testing
Website testing is carried out to test the performance of object detection on websites that have been created using the Hugging Face platform.On the website, testing is done with data that differs from the training data.However, to compare the performance of the model with the web, a test was carried out using the same data as the previous model test data.The image test data totals 100 test data images.The testing process is carried out by inputting images one by one then analyzing the results by recording the TP, FP, and FN values.In the image output, there is also the total caloric value of the object detected in the image.As an example, in Figure 4 there are objects of white rice 204 cal, fried tempeh 34 cal, and fried chicken.However, in testing Figure 10, the fried chicken object was not detected by the system so that the total calories detected were 238 cal.
After carrying out the testing process on the website, the TP, FP, and FN values were obtained from all the test image results.The test result data is displayed in the confusion matrix table, as shown in Table 6.From the confusion matrix data, it can be analyzed that the results differ from the confusion matrix table during model testing, even though testing uses the same image test data.The confusion matrix data testing on the web has many objects that are not detected and the prediction results have many errors.If you look at Table 7 regarding the calculation of precision, recall, and F1-score values, it can be analyzed that performing the test results on this implementation is quite poor with an average value of precision, recall, and F1score, respectively 0.84 or 84%, 0.32 or 32%, and 0.41 or 41%.The system can be said to be quite good if it has an F1-score close to 1 or at least greater than 0.80.In this test, the best F1-score was obtained by testing boiled eggs with a value of 0.88 or 88%, while the worst score was found in fried tofu, which was 0.06 or 6%.The reason for the poor test results is that there are several factors that might be the reasons.The most important factor is the difference between GPU and CPU when processing object detection.In testing the training, results model on Pycharm.
IDE using a Laptop GPU with NVIDIA GeForce RTX 3050 specifications, while in testing the model on the Hugging Face web using the default CPU with 2 basic vCPU specifications 16 GB RAM.The second factor is that when uploading the training result file to Hugging Face as best.pt it experiences degradation so that performing the training result file decreases not according to the original.Another factor is the use of Hugging Face, which is free, not premium or paid, thus allowing restrictions on the use of systems or functions in Hugging Face.

Conclusion
The way to design a system for detecting and estimating the calorie content of food starts with a dataset search, dataset annotation, dataset augmentation, dataset training, and model implementation.The dataset used is image data in "JPG" format.Dataset annotations involve assigning bounding boxes to objects in the image.Dataset augmentation is done to increase the variety of data.Dataset training is carried out to train data so that it can detect the model.Model implementation is carried out to apply the model that has been trained so that it can detect and estimate calories in food.
The performance of the object detection model training in food is good enough to detect objects in food.Detection performance can be seen by calculating the precision, recall, and F1-score values of the model test results.In the test results, the model has an average precision, recall, and F1score of 0.94, 0.90, and 0.91, respectively.This means that object detection in the training model has a performance of 91%.The higher the F1-score or close to 1.0, the better the detection performance.This proves to be true, like the existing theory.
The performance of the results of system implementation on websites using Hugging Face gets a pretty poor value.This can be seen through the calculation of the value of precision, recall, and F1-score from the results of testing the implementation of the model.The precision, recall, and F1score values generated in the Hugging Face test have relatively low average values, respectively 0.84, 0.32, and 0.41.This means that object detection in the model's implementation has a performance of 41%.The lower the F1score or close to 0, the worse or inaccurate the results of the detection performance will be.This proves to be true, like the existing theory.Some factors that cause low performance on Hugging Face are the use of low-quality CPU so that the processing of detection is not good.Another factor is the degradation of file quality when uploading on the Hugging Face platform and the performance limitation of Hugging Face due to using a free or non-premium account.

Fig. 3 .
Fig. 3. Model Testing ResultIf there are cases of stacked food objects, the detection system cannot classify objects properly.However, if the food objects in the image are only /doi.org/10.1051/e3sconf/202346502057E3S Web of Conferences 465, 02057 (2023)

Table 2 .
Food List

Table 3 .
Food Calorie Data 3.3 Image Data RetrievalImage data is taken manually by downloading images on the internet.The total number of image data taken is 560 images.There are 50 image data for each class consisting of apples, fried chicken, oranges, white rice, cakes, bananas, fried tofu, omelet, boiled eggs, and fried

Table 5 .
Model Testing Performance

Table 6 .
Confusion Matrix Hugging Face Testing

Table 7 .
Confusion Matrix Hugging Face Testing