Exploring the Potential of Pre-trained DCNN Models for Facial Emotion Detection: A Comparative Analysis

. Emotion detection from people's facial expressions is important nowadays to know how other humans feel, such as the interaction of an AI machine with humans, which is popular. One of them is an AI avatar. Sometimes these machines do not know what the human partner feels, so their decisions can be inaccurate. Here, AI avatars can be used to monitor human partners' healthcare conditions, such as stress, depression, and anxiety that can cause suicidal death. This research aims to get the best model to detect emotion from facial expressions by comparing some DCNN pre-trained models. The pre-trained DCNN models that are used in this research are VGG16, VGG19, ResNet50, ResNet101, Xception, and InceptionV3. This research used accuracy, precision, recall, and f-1 score to evaluate all models. The result shows that the VGG19 model has the highest accuracy than other models, which is 65%. The research can conclude that the performance of a model is dependent on various factors, such as the size and quality of the dataset used for the research, the complexity of the problem that needs to be achieved, and the hyperparameters used for the dataset, while training.


Introduction
Emotion detection from people's facial expressions is really important these days.Such as the interaction of an AI machine with humans, which is really popular nowadays; one of them is an AI avatar [1].These machines are created and used to assist and help us humans.But sometimes these machines struggle and don't know what the human partner feels or thinks, so they could be inaccurate with their decisions or responses.This case itself is quite important because if an AI avatar has a better ability to detect emotion, it can improve the bond between the AI avatar and the human itself by personalizing what to do.AI Avatar can also monitor human partner healthcare conditions such as stress, depression, or anxiety.Stress, depression, or anxiety are huge cases that can lead to suicidal death.One of the cases is from a repeated cross-sectional survey assessing university students' stress, depression, anxiety, and suicidality in the early stages of the COVID-19 pandemic in Poland [2].In this journal, it says that 7,228 suicidal deaths were caused by depression, anxiety, and stress.Detecting facial expressions can be used to solve many other cases besides just AI avatars.Another example is that this emotion detection can be used to detect facial expressions of a suspected criminal's characteristics at a crime scene [3].Using emotion detection, we could get a better understanding of whether the criminal is lying or not when they are being interrogated.Detecting emotion is one of the most complex and hard tasks to do since facial expression has various Corresponding author: yoshiven@binus.ac.id emotion classes such as happy, sad, angry, and many more.Artificial intelligence, machine learning, and deep learning are needed to solve this case because using these methods will automate the process of identifying and analyzing images or movements, such as human emotion.This research aims to get the best model to detect emotion from facial expressions by comparing some deep convolutional neural network (DCNN) pretrained models.
There are also lots of studies that focus on emotion detection and classification.There are lots of ways to do this research; some of them involve speech recognition or image recognition.Most studies used the DCNN, or deep convolutional neural network algorithm, to provide the best accuracy in image detection.DCNN is useful for getting the pattern from an image.One of the related works that uses the DCNN model is Facial Emotion Recognition using transfer learning in the Deep Convolutional Neural Network by Akhand, M. A. H., et al. [4].This research focuses on facial recognition using the pre-trained deep convolutional neural network models, which are the VGG16, VGG19, Resnet18, Resnet34, Resnet50, Resnet152, InceptionV3, and DenseNet-161 models.The best-achieving facial emotion recognition accuracy from this research is achieved by the DenseNet-161 pre-trained model, which achieves accuracy as high as 96.51% to 99.52%.
Another similar related work is Facial Emotion Recognition Using a Novel Fusion of Convolutional Neural Network and Local Binary Pattern in Crime Investigation by Zhu, D. et al. [3].The aim of this research is to analyze the psychological characteristics that are involved in a crime scene using CNN and CNN-CLBP models.The best accuracy achieved from this research was achieved by using the CNN-CLBP algorithm model, which achieved an average accuracy of 88.16%.
There are also related works learning emotion detection from video data using DCNN, such as Facial Emotion Recognition from Videos Using Deep Convolutional Neural Networks by Abdulsalam, W. H. et al. [5], which has a recognition rate of 95.12%, and more related works about emotion detection using DCNN [6,7], which use deep learning algorithms to identify basic human expressions and have achieved an average accuracy of 93%.Some of the related works we see also use only CNN for emotion detection [8][9][10] Other ideas or ways to do this project besides just regular emotion image detection.For example, there is Emotion Recognition using Convolutional Neural Network on Virtual Meeting Image by Julando Omar et al. [11].Besides that, there are also related works using Deep Learning Based Emotion Recognition and Visualization of Figural Representation by Xiaofeng Lu et al. [12].Her project is to learn emotion recognition with both speech and graphic visualization.Other examples of related works are Emotion Detection using Image Processing in Python by Raghav Puri et al. [13].Their aim is to be able to detect or classify expressions obtained both from the camera system or from preexisting images available in memory using the CNN algorithm.There are also more related works, like the real-time face emotion classification and recognition using a deep learning model by Ashish Keshri et al. [14].Which aims to be an effective way to detect emotions like neutral, happy, sad, and surprising these four emotions from frontal facial emotions also using the CNN algorithm.There is also emotion detection from social media, which can detect people's emotions from text [15], and emotion detection from people's pupil variation or the movement of their pupils [16].

Methodology
Methods used in this research include data preparation, data analysis, data pre-processing, data splitting, transfer learning, fine-tuning, evaluation, and results, as shown in Figure 1.

Datasets
The dataset used in this study is "Emotion Detection" by Ananthu from Kaggle, which consists of 7 different emotion image classes: angry, disgusted, fearful, happy, neutral, sad, and surprised (Figure 2).As seen in Figure 2, the datasets have significant differences between some classes.The 'disgusted' class has a very small amount of data compared to other classes, and the 'happy' class has a big difference compared to other classes.Which in this problem might cause a huge imbalance in the data given.So, in this case, to take care of this difference between each class, this research is going to preprocess data that may have the ability to solve this problem.These datasets from each label will be augmented to get more data, which can be doubled or even more by turning the image in different ways or directions.The images will be flipped, rotate, or change the scale of the brightness.These image shapes will be set to 224 x 224.Also, each pixel will be divided by 255, between 0 and 1, using the rescale 1./255 features.

Deep Convolutional Neural Networks (DCNN)
This deep learning method needs to be in this research because DCNN is used to divide into many layers: the convolutional layer, the pooling layer, and the fully connected layer.Also, this research needs to classify images.This research used some of the DCNN architecture, which is VGG16, VGG19, ResNet50, ResNet101, InceptionV3, and Xception.
The VGG pre-trained model was first introduced in 2014 by a researcher at Oxford University to the ILSVRC-2014, also known as the ImageNet Large Scale Visual Recognition Challenge.The VGGpretrained models this research uses are VGG16 and VGG19.The difference between both models is in the layers, where VGG16 has only 16 layers and VGG19 has a total of 19 layers [8].
ResNet stands for Residual Network; in this research, ResNet50 and ResNet101 were implemented.As it says in its name, ResNet50 has a total of 50 layers, and ResNet101 has a total of 101 layers.ResNet helps to reduce loss, preserve knowledge gained, and boost performance during training by making residual connections between layers [8].
The InceptionV3 and Xception pretrain models were both introduced by researchers at Google.Both models are improvements on the Inception pre-trained model introduced in 2014; InceptionV3 was introduced in 2015 and has a total of 159 layers.Meanwhile, the Xception pretrain model was introduced later in 2016 with a total of 36 layers.Both models are used for image classification tasks [8].

Transfer learning
This research will also use a pre-trained model to compare the accuracy between all the pre-trained models used.Using a pre-trained model will also significantly reduce the time and resources needed and achieve higher accuracy.The pre-trained models that are used in this research are Xception, InceptionV3, VGG16, VGG19, ResNet101, and ResNet50.In this research, we will also use the fine-tuning method.Fine-tuning is a method that improves models performance.From Figure 4, a pre-trained default network architecture (VGG16) in Line 1 implemented a fine-tuning method that removed the fully connected layer with size 4096, 4096, 1000 units in Line 2. Replacing the fully connected layer with a fully connected layer with size 512, 512, 7 units, the convolutional layer will be freezed and train the FC layer only in Line 3.This process allows the FC layer to warm up.Then this method unfreezes the freezed layers and trains all layers in Line 4 to perform a new classification task so the pre-trained network can recognize the classes that are not trained [17].

Metrics
To do an evaluation of machine learning models, there are some metrics that are going to be used in this research.It will be accuracy, precision, recall, and F1.The number of correct classifications throughout the full test set made by a classifier across the entire test set can be used to calculate the accuracy rate.Equation 1 is used to calculate accuracy [18].
The ratio of the true positive to all the observed positives is called precision.The true positives and some true negatives that were mistakenly reported as positives (FP) are all positives in this instance.Equation 2 is used to calculate precision [18].
Capacity of a model to properly identify the genuine positive is known as recall, also known as the true positive rate.Equation 3 is used to calculate recall [18].
The weighted average of precision and recall is the F1 score.It is a measurement of a test's accuracy, and it is derived from the precision and recall of the test.Equation 4 is used to calculate F1 [18]. 3

Result and discussion
This study was able to attain a level of accuracy surpassing 60% by employing six distinct pre-trained CNN models, which are Xception, InceptionV3, VGG16, VGG19, Resnet50, and Resnet101.Using Google Colab with GPU Standard, the best accuracy this study has achieved was 65.0%, utilizing the VGG19 pretrained model, which was further fine-tuned.The second and third best accuracy obtained will be 64.89% and 64.73% by using the pre-trained models ResNet50 and ResNet101, which were also fine-tuned for better accuracy results.As seen in Table 1 and Figures 5 and 6, it proved that the fine-tuning method improved all model metrics.The InceptionV3 model's accuracy has improved about 10%, and ResNet50 and ResNet101 accuracy has improved about 40%.The VGG16 model's accuracy has improved by 17,26%, the VGG19 model's accuracy has improved by 20.64%, and the Xception model's accuracy has improved by 7.16%.From the result, the research can prove that using a deep convolutional neural network is one of the most effective ways to learn the emotions of humans, since this is one of the hardest tasks to do because there are lots of human expressions that need to be learned by facial movement and more.In Figures 7 to 9, it can be proven that VGG19 has the best performance after finetuning with the highest accuracy, and we can see in the graph that it's a bit overfitting but still good, while others are below the accuracy of VGG19 as seen in Table 1.VGG19 pretrained models achieved an average accuracy of 44,36%, and after optimizing them with a fine-tuning method, the average accuracy can be seen to have improved to 65%.Since this research is about comparative analysis, we can see other models whose accuracy is below that of VGG19, such as VGG16, which got an average accuracy of 46.56% and was improved by the fine-tuning method to 63.82%.InceptionV3 achieved an accuracy of 52.73%, and after fine-tuning, it improved to 62.82%.Xception achieved an accuracy of 52.24% and improved to 61.4%.Resnet50 and Resnet101 have an accuracy of 24.71%, and later, it is improved to 64.89% and 64.73%.We can see that the ResNet model improved the most because of the fine-tuning method, and this may happen because of the data pre-processing that has overfit.
The research can conclude that the performance of a model is dependent on various factors such as the size and quality of the dataset that is used for the research, the complexity of the problem that needs to be achieved, and the hyperparameters used for the dataset while training.This research also proves that using the finetuning method is proven to raise accuracy results.The dataset used for this research is also an old dataset since the data image is still black and white or grayscale, which will make this pre-trained model harder to train since most of the pre-trained models are for RGB images.But it doesn't mean that grayscale images aren't worthy to be used, instead training a grayscale image has some benefits too, such as cheaper computational requirements, simplicity, and specific task advantages, such as tasks that are needed that don't focus on the color of the image but focus on the image shapes and textures.So using a grayscale image dataset will also contribute to seeing how the pre-trained models will react to the image and how to improve the output even though the image is still grayscale.

Conclusion
Pre-trained deep convolutional neural networks, or DCNN models, have been proven to be one of the best ways of learning emotion detection from people's facial expressions.Deep convolutional neural networks with pre-trained models can be used to get good accuracy, depending on the data used for training, validation, and testing.But using a deep convolutional neural network model will be expensive and time-consuming since it will take a very long time to train the datasets.But it is proven that using these pre-trained models and optimizing them with fine-tuning will have big improvements in accuracy compared to without them.
As an example, that is proven from this research, using a VGG19 pretrained model achieved an accuracy of 44,36%, and after optimizing them with a fine-tuning method, the accuracy improved to 65%.With this research, researchers can identify the performance differences of all 6 pretrained models used and achieve the knowledge of which pretrain model has the best performance which later can be used to detect emotion by facial expression for AI avatars or can be used for other more purposes.This research in the future will aim to find more ways to improve and learn more about emotion detection by testing more datasets worldwide, finding more methods that can be used to improve training and testing, and finding many more ways to approach using additional methods like ensemble voting or ensemble model, which will combine two models to provide a better process so that in the future, emotion detection can be used for artificial intelligence usage and many other things.
Which also consists of 35.887 images, and the dataset was split into 80% for training and 20% for testing.The images are 4848-pixel grayscale, and examples of each class's datasets are shown in Figure 3.There are 7.178 images for test sets and 28.709 images for train sets.For validation sets in this research, the train sets are divided by 20%.So, the ratio of train, validate, and test will be 60:20:20.This research chose this ratio of data split for the training, validation, and test sets because the total data from the dataset is huge.By using the 60:20:20 ratio data split, the data split can ensure that this research can get enough training, validation, and test sets.

Table 1 .
Table comparison of pre-trained models.