Age estimation through facial images using Deep CNN Pretrained Model and Particle Swarm Optimization

. There has been a lot of recent study on age estimates utilizing different optimization techniques, architecture models, and diverse strategies with some variations. However, accuracy improvement in age estimation studies remains a challenge due to the inability of traditional approaches to effectively capture complex facial features and variations. Therefore, this study investigates the usage of Particle Swarm Optimization in Deep CNN models to improve accuracy. The focus of the study is on exploring different feature extractors for the age estimation task, utilizing pre-trained CNN models such as VGG16, VGG19, ResNet50, and Xception. The proposed approach utilizes PSO to optimize the hyperparameters of a custom output layer for age detection in regression. The PSO algorithm searches for the optimal combination of model hyperparameters that minimize the age estimation error. This study shows that fine-tuning a model can lead to improvements in its performance, with the VGG19 model achieving the best performance after fine-tuning. Additionally, the PSO process was able to find sets of hyperparameters that were on par or even better than the initial hyperparameters. The best result can be seen in VGG19 architecture with loss of 86.181, MAE of 6.693, and MAPE of 38.462. Out of the twelve experiments conducted, it was observed that the utilization of Particle Swarm Optimization (PSO) offered distinct advantages in terms of achieving better results for age estimation. However, it is important to note that the execution time for these experiments was considerably longer when employing PSO.


Introduction
The practical uses of facial age estimates in a variety of industries, including social science, security systems, healthcare, and entertainment, have made this topic of interest in the field of machine learning.Because faces are trustworthy and useful sources of information reflecting age-related traits, facial people play a crucial part in age prediction [1].However, because of the irregular pattern of facial aging, age estimation is a difficult process.This occurs because each person's aging stages vary due to the nature of human faces, which is influenced by both internal and external causes [2].Furthermore, aging patterns provide a nonstationary, unpredictable process.Therefore, utilizing age estimation technology to recognize facial images is crucial.Because a person's face includes so much information about their age, as well as so much more facial image data than other regions of the body, facial age assessment is more accurate.The importance of age estimation is beneficial for many purposes.Age estimation can be used for medical purposes to know what procedure and dose for the medicine because for older people usually they consume more of dose medicine while for the younger ones they consume fewer.Besides that, age estimation can be used for investigation of crime related to identifying the age Corresponding author: nicholas.muliawan@binus.ac.id of a victim or suspects to make it easier to find the right person.
CNNs have emerged as a powerful deep learning approach, particularly in the field of face recognition.Numerous studies have shown that CNNs outperform traditional machine learning algorithms and have become a popular choice for researchers and practitioners alike.With the ability to automatically learn and extract features from raw image data, CNNs have demonstrated superior performance in face recognition tasks.In this context, it is worth noting that CNNs have been widely adopted and validated by various studies in the field of facial image recognition.
This study aimed to utilize the feature extraction models of VGG16, VGG19, ResNet50, and Xception, to extract highly-relevant features from our dataset.The models are modified with a regression output, then finetuned, and optimize their performance using Particle Swarm Optimization (PSO).Our approach leverages the strengths of these powerful models to enhance the accuracy and efficiency of our results.
There are a lot of studies that have already been conducted on facial identification to identify age.However, this study used a few well known Convolutional Neural Network (CNN) models through transfer learning, which were modified and fine-tuned and optimized again with Particle Swarm Optimization (PSO).Previous study suggest that transfer learning is usually preferred as it is easier rather than building a custom model [3].
The optimization approach known as "particle swarm optimization" (PSO) was inspired by nature and has been applied to numerous computer vision challenges.This algorithm imitates the cooperative behavior of fish schools or flocks of birds, where members communicate and choose the best answer.PSO is typically utilized successfully for feature selection, picture segmentation, and object detection.
This paper proposes the use of PSO in age detection to improve the accuracy of age estimation in CNN models.The proposed approach utilizes PSO to optimize the hyperparameters of a custom output layer for age detection in regression.The PSO algorithm searches for the optimal combination of model hyperparameters that minimize the age estimation error.By using PSO to optimize the model, the proposed approach can handle the significant variation in facial appearance within age groups and increase the accuracy of age detection.
Previous research on Particle Swarm Optimization (PSO) has demonstrated that the CNN-PSO approach can achieve performance.This study suggests a technique for convolutional neural networks (CNNs) hyper-parameter tuning utilizing a particle swarm optimization algorithm.The MNIST dataset is used to test the algorithm, which outperforms certain cuttingedge techniques.With 15 particles and 30 iterations, the findings show that the suggested method can successfully discover the CNN model's optimal parameters, outperforming other designs with a testing error of 0.87.[4] Another study improved the model using K-Fold Validation and pre-trained CNN models like VGG16, ResNet50, and SeNet50.When they compared their work to those of other models, they discovered that their model sometimes performed just as well.Proposed Resnet50's accuracy was 71.84%, VGG16's was 65.31%, and Senet50's was 61.96% [5].
InceptionV3 performed better than the other models in terms of accuracy, recall, precision, and F1 score in a different investigation of 3D knuckle identification.According to the study, the InceptionV3 model outperformed the pretrained models in classifying 3D images of the middle finger knuckles, with a accuracy, recall, precision, and F1 score of 99.07%, 99.18%, 99.07%, and 99.07%, respectively [6].
The use of the Discrete Cosine Transform (DCT) and Artificial Neural Network (ANN) for age prediction has also been studied in other studies.A feature extraction technique called DCTF is used to transform a picture with low correlation and lots of features into one with higher meaning and fewer features.This study makes use of facial picture datasets with ages ranging from 1 to 116.The study's findings produced a model with an MAE of 9,40 on the testing data that can identify and predict ages in the range of 1 to 116 [7].
A study on illness detection and classification uses a unique hardware model for detecting Chronic Kidney illness (CKD) from saliva samples, along with a Particle Swarm Optimized One-Dimensional Convolutional Neural Network with Support Vector Machine (1-D CNN-SVM).The optimized architecture performs better than competing techniques and achieves 98.25% accuracy [8].
According to one study employing PSO for Multivariate Time-Series Analysis to optimize CNN models, the PSO-CNN technique outperformed conventional CNN architectures.The 178 neurons in the fully connected and two hidden layers produce the model with the lowest Root Mean Squared Error (RMSE), which is 1.386.According to the study's findings, CNN can automatically create efficient designs and cut down on lengthy experimental and computing times by employing PSO as a tuning hyperparameter [9].
Finally, a different study demonstrated that the suggested approach for deploying EFI-CNNs is based on fuzzy integral theory.The EFI receives the outputs of trained CNNs as input, and particle swarm optimization is utilized to find the ideal fuzzy density value.Compared to previous methods, the method was more accurate at classifying age and gender.According to the experimental findings, the proposed technique was more accurate in classifying age and gender by 5.95% and 3.1%, respectively [10].
By incorporating PSO to optimize the hyperparameters of CNN) models for age detection, the accuracy of age estimation can be significantly improved compared to traditional methods.This improvement is expected to be achieved by utilizing PSO to fine-tune the custom output layer for age detection in regression.The use of PSO in optimizing the model's hyperparameters is expected to handle the substantial variation in facial appearance within age groups and enhance the accuracy of age detection.The proposed approach aims to increase the performance of the model that will be used in this study.

Methodology
This research used the CNN methods to estimate facial age by image, because CNN has been commonly used by many researchers and so far, they have produced a great value of results.Then, this study compared our research CNN + PSO with CNN standard.The initial stage involves a review of related studies to identify relevant problems and solutions.The researchers then locate a suitable dataset from Kaggle that meets their research requirements.The dataset undergoes preprocessing before being used to create a model.The PSO implementation is then incorporated into the prediction model, and the outcomes are assessed.

Dataset collecting
The dataset used for this analysis was obtained from Kaggle in 2021 and is called "Age, Gender and Ethnicity (Face Data) CSV" by Nipun Aroro.The data is in a .csvformat and contains 23479 images of various faces with different ages ranging from 1 to 116.The dataset consists of five columns: "age", "ethnicity", "gender", "img_name" and "pixels".Each row in the dataset represents an individual image, and the columns provide information on the age, ethnicity, and gender of the person in the image, as well as the image's filename and pixel values.Figure 1 illustrates that the age dataset used in this analysis contains data ranging from 0 to 116 years.The largest amount of data is concentrated in the age range of 20-40 years.However, this distribution results in an imbalance in the dataset, as there are relatively fewer samples available for the younger and older age ranges.This imbalance may impact the performance of any model trained on this dataset and must be taken into account during analysis.

Data pre-processing
The collection of facial images used in this study was originally composed of pixel values.The data was preprocessed by normalizing and downsizing the photos to a uniform size of 48x48 pixels before model training.
The dataset was subsequently split into distinct "train" and "test" sets for the aim of validating and improving the model.To evaluate how well the trained model performed, Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) were chosen as the assessment metrics in addition to Mean Squared Error (MSE) as the loss function.To assess the model's effectiveness, the execution time was also noted.These criteria were chosen to give a thorough knowledge of the model's efficacy and accuracy in estimating people's ages based on their facial image.The visual representation of the facial picture dataset utilized in this research is shown in Figure 2. The collection includes facial photographs of people of various ages and ethnic origins.An overview of the various face characteristics that the model must take into account in order to effectively forecast age is given via the data's visual representation.Additionally, it draws attention to the dataset's diversity in lighting, position, and facial emotions, which could affect how well the model performs.All things considered, the image format offers a more perceptive and intuitive comprehension of the dataset, enabling a more educated analysis of the model's performance.Three main types of layers make up the CNN: fullyconnected (FC), pooling, and convolutional layers.A group of teachable filters or kernels that extract data from the input image are convolved with the input image in the convolutional layer.In order to introduce nonlinearity into the output of the convolution layer, the rectified linear unit (ReLU), for instance, is utilized as an activation function.

Convolutional Neural Network (CNN) model
The output of the convolutional layer is downsampled in the pooling layer in order to condense the feature maps and keep just the most crucial features.Max pooling, a popular pooling technique used in CNNs, chooses the highest value from a collection of nearby neurons in the feature map.
Then, the output is flattened and fed into fully connected layers, which in this study perform regression on the extracted features.In this study a custom output layer of a flatten layer, a dense layer with 128 neurons, and another dense layer with a single neuron unit for regression.This custom layer is added to each CNN model that is tested.
Because they have shown great accuracy and robustness on a range of image identification benchmarks, the pre-trained convolutional neural network (CNN) architectures VGG16, VGG19, ResNet50, and Xception are frequently used for image classification applications.In order to learn high-level representations of the input images, these models include deep architectures with numerous layers that can efficiently collect complicated information from the input images.They have also received training using sizable datasets like ImageNet or VGGFaces [3].As opposed to training a CNN from scratch, using these pre-trained models as a starting point and then optimizing them for a particular job can save a lot of time and computational resources.

Fine tuning
A well-known machine learning technique called finetuning allows a model that has already been trained to adapt to a new job by changing its learnt parameters in response to fresh input.A few layers of the pre-trained model are often unfrozen and made trainable for the new job during the fine-tuning phase, while the remaining layers are kept frozen to protect the previously learnt generic features.This strategy aims to make use of the pre-trained model's learnt features while also enabling model customization for the current task.This study, on the other hand, looked at a more aggressive approach to fine-tuning by unfreezing and making the entire pre-trained model trainable for the specific task.Fine-tuning has been shown to be a highly successful technique for improving the performance of pre-trained models on new datasets for specific tasks.Many studies have demonstrated that fine-tuning can significantly increase a model's accuracy and generalization abilities, particularly when the pretrained model has learned features that are similar to the current task [11].As a result, the technique employed in this study to optimize the entire pre-trained model for the specific task may have the potential to further improve the model's performance on the task of estimating facial age.

Particle Swarm Optimization Implementation
The entire procedure of the suggested method is depicted in Figure 4. To enable the model to be trained and tested on the dataset, it will be separated and processed.With the exception of the output layer, the layers in the models will be frozen or set to untrainable for transfer learning.The model will be evaluated using the test data following the completion of the transfer learning process.The models will then be turned to trainable or unfrozen for fine tuning.The PSO method will then optimize the hyperparameters of the models after fine-tuning.
Particle Swarm Optimization optimizes the hyper parameters in the output layer that is added to the CNN models, which is the neuron unit of the dense layer, as well as the epoch and learning rate for the training.The PSO algorithm finds the best combination of the hyperparameters to find the best performance.Particle Swarm Optimization applied after the CNN obtained the rain CNN models into the initial particle.After that, the PSO calculates the fitness value until reaching where the PSO found the best fitness value.At the end, the best fitness value produces the output of prediction resulting with higher accuracy.The algorithm for Particle Swarm Optimization is depicted in a flowchart in Figure 4.In typical PSO, after the initialization of the population with random locations and velocity vectors, evaluate the fitness function to determine the performance of each particle.In order to discover the optimal hyperparameters, it then finds and updates the local and global best solution particle.Each time a new iteration begins, the locations and velocities of the particles are updated [12].At the end of each iteration, the performance of all particles will be evaluated by predetermined cost functions which we employed MAE.

Mean Absolute Error (MAE)
A frequently used metric for assessing the precision of predictive models is mean absolute error (MAE).
Between the expected and actual values, it calculates the average absolute difference.MAE does not severely punish large errors, unlike other error metrics like Root Mean Square Error (RMSE).As a result, it becomes a more reliable metric when handling data outliers.By summing the absolute differences between the expected and observed values across all observations, MAE is obtained.The accuracy of the model improves with decreasing MAE values [13].
| | The MAE computation is rather straightforward.Finding the absolute values of the difference between the projected value and the actual value constitutes it.The total number of predictions was then divided by the sum of all absolute differences.With this, we can get the model's average error.

Mean Squared Error (MSE)
A common measure for evaluating how well prediction models perform is mean squared error (MSE).What is measured is the average of the squared difference between the expected and actual values.Since MSE gives enormous errors more weight than Mean Absolute Error (MAE), it is a more sensitive metric than MAE.
When the purpose of a regression model is to predict a continuous variable, MSE is used to assess the model's accuracy.MSE is calculated as the average of the squared differences between the expected and actual values.As the MSE values drop, the model's accuracy rises.
( ) To begin, we subtract each data point's true value from its anticipated value to arrive at MSE.We then square these differences to make sure they are all positive and to give larger differences more significance.After that, we add up all of these squared discrepancies and divide the entire number of predictions by this number.This results in the MSE, or mean square error, between the expected and actual numbers.

Mean Absolute Percentage Error (MAPE)
Regression model quality is measured by the Mean Absolute Percentage Error (MAPE), which is frequently employed in forecasting applications where it is known that the quantity to be predicted will always be well above zero.It is calculated by averaging the absolute difference between the actual value and the anticipated value.The common ground between these three metrics is that lower is better [14].

| |
MAPE calculation is similar to MAE, but instead of just finding the absolute difference, we find the percentage of that difference.By dividing the absolute difference by the actual value and multiplying it with 100, we will get the difference percentage.Then we sum it all up like MAE and divide it by the total prediction.

Results and discussions
In order to avoid overfitting, we trained the model for 50 epochs with early stopping during transfer learning while freezing the feature extractor layers of the model.With a learning rate of 0.0001, we used a batch size of 64.All the layers were then unfrozen during the finetuning process.With the exception of the learning rate, which was decreased to 0.00001, the hyperparameters are the same.
Figure 5 shows the best results of transfer learning and fine tuning, which is the VGG19 model.Fine tuning a model resulted in a significant increase in the performance, in this case the loss of VGG19 model dropped from 232.671 to 88.688.Although the graph indicates a case of overfitting as the training loss is much lower.Based on the results presented in Table 1, it is clear that fine-tuning a model can lead to some improvements in its performance.The best performing model before fine-tuning was ResNet50 V2 with a loss of 184.074, followed by Xception with 191.325,VGG16 with 204.307, and VGG19 with 232.671.However, after fine-tuning, the VGG19 model achieved the best performance with a loss of 88.688, followed by VGG16 with 90.465, ResNet50 V2 with 122.494, and Xception with 128.768.It is important to note that the execution time for fine-tuning may be longer due to early stopping.With our proposed study, the PSO process was able to find another combination of hyperparameters, specifically the neuron unit, epoch, and learning rate.Although the performance increase achieved by PSO was not significant, it was able to find sets of hyperparameters that were on par or even better than the initial hyperparameters.PSO suffers early convergence at the initial stages and consumes a lot of time.PSO is more suited for smaller dataset compared to larger dataset, since it took a long time and large space search [15].
The results of this study indicate that the incorporation of PSO to optimize the hyperparameters of the CNN models for age detection led to some improvement in the accuracy of age estimation.Although the improvement obtained was not statistically significant, there was a discernible increase in the model's performance.The utilization of PSO to finetune the custom output layer in regression allowed for better handling of the substantial variation in facial appearance within age groups, resulting in enhanced age detection accuracy.While the improvement may not be considered significant in a statistical sense, it does demonstrate the potential of using PSO for optimizing hyperparameters in the context of age estimation.
Future work may try experimenting with different output models and hyperparameters settings that may lead to better performance.It's important to remember that performance improvement may not always result from merely making the output layer more complex.The architecture and hyperparameters of the complete neural network, including the input and hidden layers, as well as the regularization strategies applied to avoid overfitting, must also be carefully taken into account.To further enhance the quality of the input data, it may be advantageous to experiment with various data preparation, feature selection, or data augmentation strategies.

Conclusion
Overall, these results demonstrate the importance of fine-tuning and exploring different hyperparameters to improve the performance of machine learning models.Fine-tuning can significantly improve model performance, and techniques such as PSO can be used to further optimize hyperparameters and achieve even better results.These results demonstrate the effectiveness of the PSO process in finding optimized hyperparameters that can improve the performance of machine learning models.VGG19 and VGG16 were able to achieve better results after the PSO process, while ResNet50V2 and Xception did not see as much improvement.It is important to note that while the PSO process was able to find improved hyperparameters, the execution time for this process may be longer than simply fine-tuning the models.Therefore, it is important to carefully consider the trade-offs between performance improvements and execution time when selecting an optimization method.

Fig. 3 .
Fig. 3.The structure of a general convolutional neural network (CNN) used in this study.

Fig. 5 .
Fig. 5.The workflow of the program and the flowchart of the Particle Swarm Optimization (PSO).

Figure 6
Figure 6 to 9 shows the loss graph of training with the best hyperparameters found during the PSO process.All of the models are able to find a convergence point in a certain epoch.Although, all the graphs still show the case of overfitting, as can be seen in the training curve as they are still going down.Based on the results presented in Table1, it is clear that fine-tuning a model can lead to some improvements in its performance.The best performing model before fine-tuning was ResNet50 V2 with a loss of 184.074, followed by Xception with 191.325,VGG16 with 204.307, and VGG19 with 232.671.However, after fine-tuning, the VGG19 model achieved the best performance with a loss of 88.688, followed by VGG16 with 90.465, ResNet50 V2 with 122.494, and Xception with 128.768.It is important to note that the execution time for fine-tuning may be longer due to early stopping.

Table 1 .
Comparative results from experiments.Based on the results of the PSO process, the best performing model was VGG19 with a loss of 86.181, MAE of 6.693, and MAPE of 38.462.The second best performing model was VGG16 with a loss of 89.579, MAE of 6.831, and MAPE of 37.195.ResNet50V2 achieved a loss of 120.874,MAE of 7.965, and MAPE of 50.975, making it the third best performing model.Lastly, Xception achieved a loss of 124.91,MAE of 8.137, and MAPE of 52.069, making it the least performing model of the four.