Automated skin burn detection and severity classification using YOLO Convolutional Neural Network Pretrained Model

. Skin burn classification and detection are one of topics worth discussing within the theme of machine vision, as it can either be just a minor medical problem or a life-threatening emergency. By being able to determine and classify the skin burn severity, it can help paramedics give more appropriate treatment for the patient with different severity levels of skin burn. This study aims to approach this topic using a computer vision concept that uses YOLO Algorithms Convolutional Neural Network models that can classify the skin burn degree and determine the burnt area using the bounding boxes feature from these models. This paper was made based on the result of experimentation on the models using a dataset gathered from Kaggle and Roboflow, in which the burnt area on the images was labelled based on the degree of burn (i.e., first-degree, second-degree, or third-degree). This experiment shows the comparison of the performance produced from different models and fine-tuned models which used a similar approach to the YOLO algorithm being implemented on this custom dataset, with YOLOv5l model being the best performing model in the experiment, reaching 73.2%, 79.7%, and 79% before hyperparameter tuning and 75.9%, 83.1%, and 82.9% after hyperparameter tuning for the F1-Score and mAP at 0.5 and 0.5:0.95 respectively. Overall, this study shows how fine-tuning processes can improve some models and how effective these models doing this task, and whether by using this approach, the selected models can be implemented in real life situations.


Introduction
Burn wounds has always been one of the world's most common injury, ranging from first degree, where the burn will only affect the epidermal layer (the outermost layer) of the skin, to the third degree, where the burn would (at minimum) extends to the fat layer (beneath the dermal layer) of the skin, and may have several dangerous effects to the wounded individual, dangerous enough to the point that the burned part of the body would sometimes be required to be grafted in order to fully close the burn wound and prevent any further infection or any further negative aftereffects.
The most prevalent problem in the subject of burn wound degree is the comparison between second-degree burn and third-degree burn, as there are some cases where the difference would be minimal, and people, including the wounded individual, would sometimes think lightly of a burn wound that they thought was a second-degree burn when, in reality, it was a third-degree burn.Cases as such this often happen around the world, with people having to deal with the aftereffects only after quite some time has passed, cases such as a light third-degree burn, which should've been able to be dealt with using several antibiotics and other medicines to improve the natural recovery rate of the burnt part of the skin, was left alone for too long, the negative aftereffects continuing to Corresponding author : julius.ferdinand@binus.ac.id corrupt and decay the healthy part of the skin (also parts of flesh and muscles) around the burn wound, and would result in having to graft the part of the skin (and sometimes flesh too) in order to fully lift the decaying part and to fully deal with the negative aftereffects once and for all, as normal antibiotics and medicines would no longer have any effect (and sometimes might even gave negative effects to the wound and the healthy part around it instead) on the wounded skin.From the official website of World Health Organization (WHO), estimated around 180.000 death caused by skin burn, and in some part of the world, 17% of children with burns have a temporary disability and 18% have a permanent disability [1].Majority of the cases happened in low to middle-income countries.
This topic of skin burn wound degree detection and classification is a study that has gotten quite a lot of interest from researchers and other experts in the field.Each of their studies consists of different approaches and methods, resulting in various results, which all can be considered positive outcomes, as most of the research's results ended with an improvement in accuracy, efficiency, and usability.
Despite numerous negative factors that could potentially impede the progress of the research on this topic, researchers have managed to find ways to overcome them using various ways.One research paper [2] tried to solve dataset limitations to achieve efficient training samples using various methods available and the paper concluded that SSL-based models give the best performance compared to the other methods.
Feature extraction on this topic has been discussed in two different papers, one using texture-based feature extraction [3], and the other [4] tried to develop a feature extraction model to classify various factors of a burn wound, such as burn area, depth, and location.From these two papers, it can be concluded that the use of features such as texture, graft and non-graft, and colour to detect and classify the skin burn degree can be tried.
Most of the research done on this topic is done on CNN models, and one such research [5] was done using the DCNN approach and using pre-trained models such as ResNet50, MobileNet, and VGG16.Another research [6] tried using semantic segmentation to try to improve the result accuracy on the CNN model, and the result was successful.Another research [7], which had been mentioned previously, tried another segmentation method, Simple Linear Iterative Clustering (SLIC) superpixel segmentation with a SVM model and achieved results that were considered to be effective enough for hands-on practice.There is also another research [8] which made a lot of comparisons on segmentation methods used on a CNN model where they tried to compare 8 different hybrid segmentation algorithms.
One researcher [9] experimented with creating a comparison of two different approaches to modifying a pre-trained model using ResNet.In the first approach, the top-most layer and the classification layer were removed, and new layers were trained using image features from the remaining convolution layers.In the second approach, the top-most layer and the classification layer were removed but instead of adding new layers, SVM was used as a substitute for classification purposes.It is concluded that using SVM as the classifier gives better results in output accuracy.
Another research paper [10] explained that humanbased frameworks will be a great addition to the AI model of this topic.Not only does it allow humans to add features in accordance with their needs, the insight owned by the human experts can also be integrated into the machine's program, creating a more effective model.
Another idea was proposed by a research paper [11] in which they used deep learning algorithms to classify burnt areas in images.The research paper itself is very curious and was made based on a very creative idea, utilizing Fuzzy C-mean Clustering to create segmented regions of the images to simplify the machine's work and create a more efficient process of classification, and surprisingly, managed to reach quite a high percentage of accuracy when compared to some of the current state-of-the-art methods of approach.
In this research, the burn wound images dataset from Kaggle [12] was used as it already was labelled using YOLO label format and by combining it with another dataset from the Roboflow website [13], which would result in a very discernible dataset to be used for supervised learning of the model, with the label class being our expected result from our machine.
The purpose of this research is to experiment with the topic of skin burn detection and classification using the YOLO algorithm model and how tuning can improve the performance of these models.This research has the potential to improve detection and classification for this topic using the newest cutting-edge technology.

Dataset
The dataset that was used in this research is obtained by combining a dataset [12] obtained from Kaggle, consisting of approximately 1300 different images of skin burn wounds with different degrees of skin burn with the dataset splitting list being listed at Table 1, and a dataset [13] from a renowned computer vision-related platform, Roboflow, with the dataset splitting list being listed at Table 2, resulting in a total of 3691 burn wound images which are split approximately 5:5:2 for first degree, second degree, and third degree respectively, .The dataset is split into 80% for the training dataset, and 20% other is split for both the validation and testing dataset (10% for each of the two), as it is listed at Table 3.The images present within the dataset itself are resized so that every image has the same size as a default, which was set as 640 x 640 pixels.Some examples of the data inside the dataset can be seen from

Data augmentation
As the total data that may act as their respective roles within the dataset was clearly not enough, some data augmentation techniques [14], such as horizontal flip, increased output example, bounding box horizontal and vertical flip were applied to the dataset.This leads to the increase of total data within the dataset, increasing around three times the original total data within the dataset, reaching 10032 total images for the training dataset, 457 total images for the validation dataset, and 462 total images for the testing dataset.

Experimental setup
YOLO algorithm [15], an abbreviation for the term "You Only Look Once", is an algorithm that is mainly used to detect various objects within an image (in real-time) by implementing CNN to detect the existence and locations of said objects.Convolutional Neural Network (CNN) is a deep learning algorithm that works in multiple layers that can receive an image as an input and output the classification result of the image.The format of the dataset required for the YOLO algorithm model to proceed is for it to be labelled in YOLO format, with the necessary annotations (YOLO labelling format usually has 3 annotations, which are the object's class, the object's coordinates, and the object's bounding box width and height, written in that exact order) being available for the model to read and process.
The way this algorithm works is that first, the image given to the model would be split into different grids with equal dimensions on each grid.Each of the grids created will then be checked individually to find whether it contains an object or not using CNN (this process happens simultaneously).Once the grids find an object, they will then create a bounding box using a regression method to predict the correct height, width, centre, and class of the object.Once all of the objects found within the image has their own bounding box, the model would then use the Intersection over Union (IoU) metric to ensure that there are no unnecessary additional height or width within the predicted bounding box and to make sure that the real box consists of the exact size of the object, without any additional size added into it.
The final result of the detection model should be the original image filled with various different bounding boxes, each representing an object that is detected within the image.
In this research, several YOLOv5 [16] models that differ in size and a YOLOv7 [17] model is chosen as the pre-trained models used for the experiment.The model would be fine-tuned according to the best hyperparameter evolution result found, and then be compared to the result of the non-tuned pre-trained model.The workflow methodology listed on Fig 2 was made based on the utilization of multiple dataset, with the first step being the gathering of public dataset which are going to be used, followed by merging and storing the combined dataset into one singular dataset, and then cleaning and augmenting the dataset so it can be implemented into the YOLO algorithm model which would be tuned using hyperparameter tuning and then evaluated using the validation dataset which has been prepared previously.The reason behind the idea of utilizing YOLO algorithm models to detect and classify burn wounds from images is the performance of the models based on this very algorithm dwarfs the performance of other wellknown models by quite a significant amount.The ability handle multiple objects and separating them as individual objects using bounding boxes while simultaneously classifying the very objects that has been detected was a rare advantage that only some models have, and YOLO is one of the best models in the list.The efficiency of the performance, which is largely thanks to the effectivity of the DCNN architecture used for the models, are also one of the main reasons why YOLO is one of the best-performing models within the topic of object detection and classification.Not to mention the flexibility it has with various objects which might be included in the dataset used for experiments, which is also another reason why this model is used for this research, as the topic of skin burn detection and severity classification will have various types of datasets, starting from simple images consisting of a single burn wound filling the image from one side of the border to the other, to images of a skin burn victim which has multiple burn wounds with different level of severity.The ability to process a large amount of data is another reason for this model to be chosen.Indeed, other models that is often mentioned in the topic of computer vision also has the same or similar advantage, but the ability to process a large dataset that has been split into the supposed roles (train, test, valid) using a singular command file in the form of .yamlwhile also declaring other functional necessity like the number of classes, is one of the reasons why YOLO is considered not only as one of the best in performance, speed, and efficiency, but also simplicity.
There are also a lot of ways you can experiment with the model, either it is by implementing various tuning to the model itself, which is a pretty plausible thing to do considering the model itself was built while utilizing a network backbone like deep convolutional neural network, which is a very flexible network that can be modified easily by controlling the total layers inside the network itself in order to experiment, as the main prediction system.

YOLOv5
YOLOv5 [16] is, as the name states, the next version of the YOLO algorithm model which was an improved version of both YOLOv3 and YOLOv4.Compared to its predecessors, this version of the YOLO algorithm model has achieved a major improvement in the aspect of balance between processing speed and the resulting accuracy.The models chosen for this experiment are the top three heavyweight model versions among the available versions, which consist of YOLOv5m, YOLOv5l, and YOLOv5x.
The way this model works is that it would start by extracting features from the images gained from the dataset given using a network backbone.From these extracted features, it would then apply multiple convolutional neural network layers on top of the network backbone, creating a feature pyramid.The result of the process would be multiple different scales of feature maps in the image, of which would be processed further to find and capture the bounding box of the object.These bounding boxes would then be processed further by utilizing anchor boxes, which are the predefined bounding boxes of its own size and ratio., as a reference in order to predict the possible bounding box of the object.Once the predictions are done, the model would then generate prediction results based on the prediction head, which are a list filled with results that the model has processed.Based on this list, the model would then utilize a technique called Non-Maximum Suppression (NMS) [18] in order to get rid of all bounding boxes which are considered redundant or overlapping, leaving only one prediction result which the model is the most confident of, which would then be processed to generate the final result of the object detection and classification process, which is the original image with the bounding box of the detected object attached to it.These processes are roughly defined in Fig 3 and

YOLOv7
YOLOv7 [17] is a YOLO algorithm model that was released quite recently and can be stated as the current best YOLO algorithm model, as its detection speed and accuracy are said to be higher and better than any other previous YOLO algorithm model.This model was improved enough for it to be used properly not only in the field of bounding box object detection but in semantic segmentation as well.
The way this model works is roughly the same as YOLOv5, with only a couple differences due to the difference in version.First, it would start the same way as YOLOv5 did, by extracting features using a network backbone.It would then follow the same steps as YOLOv5, processing the extracted features into different scales by applying multiple convolutional neural network layers, creating a feature pyramid.The model would then employ anchor boxes as a reference to predict the possible object bounding boxes.Now here lies the difference between YOLOv5 and YOLOv7.YOLOv5 did this step with single-scale detection approach, which would process one single layer at a time.YOLOv7, however, employs a different approach method, which is the multiscale detection approach, which as the name states, is a technique that allows the model to process multiple layers from the feature pyramid at once.This method can be considered as both a positive and negative improvement when compared to its predecessor's method of approach, as it would allow for a faster processing rate when compared to the single-scale detection approach that YOLOv5 implements as a part of the detection process, but it can also be seen from a negative perspective as it would consume more computational resources than single-scale.
Once the process is done, the model would then assign each object into a specific grid cell based on the center of the coordinate gained from the detection process.This grid would act as the main reference when the model would assign a class to the object that has been detected.The following steps is the same as YOLOv5, creating a prediction head based on the results gained from the detection and classification process, using the NMS technique to find the best result, and generating the final result as the original image with the bounding box attached to it.Another difference that can be found between YOLOv5 and YOLOv7 is the network backbone used.While YOLOv5 used CSPDarknet, YOLOv7 used either Darknet or ResNet [20].

Hyperparameter tuning
To experiment with the result of the normal pre-trained model of the YOLO algorithm models even further, hyperparameter tuning [21], [22] was used for each of the YOLO algorithm models, using the best hyperparameters gained from the hyperparameter evolution of each model.This was done not only to further broaden the research's scope and result possibility but to also see whether tuning the hyperparameter would result in an improvement in the performance of the models, which should be able to be seen through the metrics used within this research.

Metrics
The metrics used within this research vary from various types, so it is required to gain the confusion matrix of the model's result in order to calculate the said metrics.The data within the confusion matrix is split into 4 types of data, which are True Positives (TP, where the label of the data used was a positive label, and the model's result was also a positive), True Negatives (TN, where the label of the data used was a negative label, and the model's result was also negative), False Positives (FP, where the label of the data used was a negative label, and the model's result was a positive instead), and False Negatives (FN, where the label of the data used was a positive label, and the model's result was a negative instead).These data would then be used as the variables in each of the performance metric's equations in order to obtain the performance metric's percentages.
This research uses the key metrics of the YOLO object detection model, which are Precision (1) and Recall (2), which are two of the usual default metrics used for evaluating the performance of a machine learning model, F1-score (3) which was calculated using Precision (1) and Recall (2) to calculate the average value of both metrics, and mAP (4), short for mean Absolute Precision, which is a metric used to calculate the average absolute precision for confidence levels of 50 and 50-95.The performance metrics mentioned above are the usual (or default) metrics that are almost always used as performance metrics for a classification and object detection model.Each metric has a different equation formula [23], which would result in a variety of answers, but the main idea of these metrics remains the same, the higher the percentage is (or the closer it is to 100%), the better.The equations of each performance metric are as follows:

Result and discussion
In this study, authors developed an Automated Skin Burn Detection and Severity Classification using a YOLO Convolutional Neural Network Pretrained Model.The result of our study demonstrated the potential of this new cutting-edge technology for this topic to improve the accuracy and efficiency of detecting and classifying skin burns in real-world cases.
Within this research, the usage of a pre-trained YOLO algorithm model was done together with the usage of the hyperparameter evolve feature, in order to create random hyperparameters that might just be the best hyperparameter there is for the model itself.As it can be seen from the results in Table 4, the performance of each model had almost always resulted in an improvement after it was run using the hyperparameters found from hyperparameter evolution of each model.The increase in performance result can clearly be seen from each of the mAP metrics of each model, with an increase averaging 2-3% for each of the YOLOv5 models, and an increase of 13% and 8.5% for mAP@0.5 and mAP@0.5:0.95 on the YOLOv7 model.
Although there is a decrease in other metrics of the model after hyperparameter tuning, such as Recall and Precision, it is an expected decline because an increase in Recall will always cause the Precision metric to decrease, and vice versa.The decrease in the F1-Score metric on the YOLOv5m model can also be considered as a result of the Precision metric of the tuned YOLOv5m model is increased, and as both of the mAP of the tuned model increased, it can be said that it is a positive increase of performance.
The comparison between each model of YOLOv5, which differs in size shows similar form of learning rate curves, especially the two best performing models, the YOLOv5l and YOLOv5x, starts at quite a high value (around 0.025-0.05),while the YOLOv7 model shows a particularly interesting graph curve for the learning rate of the hyperparameter tuned YOLOv7 pretrained model.This shows that although YOLOv5 can be said as the older version of the YOLO algorithm model when compared to the YOLOv7 model, it is not necessarily worse.It can also be seen that the YOLOv7 model without hyperparameter tuning starts at the top of the average point, which is around 0.05.The train line also shows a quite a rise near the end of the training process, signifying that it is not necessarily convergent yet.And the learning rate graph of the YOLOv7 with hyperparameter tuning also proves that the YOLOv7 model not only was evaluated worse through the metrics percentage, but was also through the learning rate graph, with uneven and abstract flow of the graph line, creating a wave instead of a steadily declining curve, like the ones from YOLOv5 models.
Each of the YOLOv5 models also shows through the learning rate graph that the average start for training loss is 0.025 for the models without hyperparameter tuning as it can be seen from Although there is a chance that that very fact means that the model becomes overfit to the training data, it is more often considered that the longer a model took to converge, it is more likely for the model to perform better at the task it was assigned to.So, it can be said that the hyperparameter tuning done for the YOLOv5 models had done its job, and improved the potential of the model, and as it can be seen from the metrics percentage, it improved the performance of the model as well.
When compared to previous results from research shown on [5], YOLOv5 pretrained model, performs worse than VGG-16, MobileNet, and ResNet50 with transfer learning.The difference between the F1 score of YOLOv5 pretrained model without transfer learning is not so significant when compared to these models.This may be caused by the size difference of dataset sample used by that paper and this paper.Another research [10] that experimented with human intervention on the model to improve the performance of the model had managed to improve the F1-Score of the model by around 5%, which is a comparably better result when compared to the result of our model.The utilization of human intervention in the loop has shown quite an improvement for the performance of the model and might be the main reason for the difference between the result of the hyperparameter tuned YOLO algorithm model used in this research.
This paper proposed the use of YOLO algorithm models to detect and classify skin burn from various images gained from datasets consisting of images of various degrees of burn injuries.The paper consists of experiments done on the YOLO model by training it using the datasets and using the model to predict the correct position and the class of the burn wound in the image within the dataset.
The result gained from the experiments shows that the use of the YOLO algorithm model can be considered as a positive approach.Based on the results gained from the YOLO models used for the experiment, it is clear that the proposed models show promise for future research topics, as just by utilizing some simple techniques (ex.Hyperparameter tuning), the proposed models have shown performance that exceeded expectations, and the use of data augmentation for the dataset has also been proven effective.
However, despite the positive result of this study, there are still several notes that must be taken for future research.One of such notes was the amount of data inside the dataset used for the experiment, as it is simply not enough to implement simple data augmentation techniques to enlarge the scale of the dataset.It is a necessity for other variations of data to be included in the dataset used for experiments for further improvements of the models.Various pictures gained from other datasets can also be used to improve the variation of the dataset and bring up the performance of the YOLO models even higher.However, the amount of data that can be considered 'usable' from the quality and variance point of view might be a bit hard to gather, as a lot of available data (or in this case, images) that can be gained has not been labelled and cleaned for experiment use, and it is also necessary to analyze the data that is going to be used for the experiment in order to filter out data which might cause a negative effect (ex.Decrease in performance, data noises which might cause confusion for the model, etc.).
There are also other methods of data augmentation techniques which can be implemented in this topic, such as flipping, random erasing, and translation [14].The implementation of these data augmentation techniques is expected to bring improvement to the performance of the models by creating even more variants of data that the model can use to learn and train in order to recognize even more variants of patterns that exists within the dataset.
However, it can be said that not all data augmentation techniques are equally effective or suitable for this topic of research.The usage of some data augmentation techniques, for example, data augmentation techniques which affects the colour of the data like colour space transformations (saturation control, colour distribution alteration, grayscaling, etc.) [14] might result in a reduction of performance instead, either it is by creating a false pattern within the dataset that the model might recognize and use as the baseline of its classification processing, or by changing or even completely eradicate the main features in a data that the model relied upon to classify a data.
Further experimentation with the utilization of hyperparameter evolution [21], [22] is also another topic that can be mentioned or even utilized for future research.The use of the hyperparameter evolution technique for a longer period of processing time and bigger epoch might create an opportunity of finding better hyperparameters for each model when compared to the performance of the models that has been hyperparameter tuned based on the evolution done in this research.There are, however, a couple complications that needs to be addressed if this idea were to be implemented, such as the need of more computational resources and time, the possibility of not finding a better hyperparameter than the one gained from the hyperparameter evolution in this paper, etc. which would mean that for this idea to be implemented, experiments with a far larger scale than the ones done in this paper were necessary in order to make sure the result gained would be worth preserving through the problematic factors and implications which might arose due to the experimentation itself.
There are also other methods that can still be experimented with for the proposed YOLO models to show an even better performance.There are also other variations of the techniques used within this paper that can also be applied to experiment with the result and for the result to be analyzed, all to improve the performance of the proposed models.

Conclusion
This research was made to propose the use of some of the newest YOLO algorithm models to detect and classify a burn wound's degree within an image, and based on the result obtained, this research can be considered a promising and plausible idea in the topic of burn wound detection and classification.This conclusion was also backed up by the fact that doing a simple parameter tuning using hyperparameters which were gained from hyperparameter evolution had resulted in quite a significant amount of raise in the performance of each model, especially in the metric of mAP for both mAP@0.5 and mAP@0.5:0.95.This shows that this topic can still be improved through various ways, such as increasing the amount of dataset, adding more data augmentation to the dataset, doing more hyperparameter evolution and analyzing the result of the hyperparameter evolution in order to pick a hyperparameter which has the potential to further improve the performance of the YOLO models.

Fig. 1 .
Fig. 1.Data example from each class in the dataset.

Fig 3
clearly describes the architecture of the YOLO algorithm models, of which the network backbone, which structure can be seen from Fig 4, differs for each version of YOLO.This part will be explained further in the following subsections for each model.

Fig 5 and
Fig 7, and average 0.04 for the models with hyperparameter tuning as it can be seen from Fig 6 and Fig 8. This, however, does not mean that the hyperparameter models are worse than the nonhyperparameter ones, as the hyperparameter models took more epochs to convergence than the nonhyperparameter models, as shown from Fig 6 and Fig 8, of which the graph shows that it has not converged even after the last epoch (epoch 25), while the nonhyperparameter models as it can be seen from Fig 5 and Fig 7 has converged near the end of the epoch (epoch 24-25).

Table 1 .
Total of images for each class and their total in each partition in the dataset from Kaggle

Table 2 .
Total of images for each class and their total in each partition in the dataset from Roboflow.

Table 3 .
Total of images for each class and their total in each partition.

Table 4 .
The result of the YOLO algorithm model without and with hyperparameter tuning.