Sustainable Facial Authentication and Expression Prediction using Deep Learning Techniques

: To make machines more human-like, computer vision is essential. Computer vision is a field that focuses on mimicking the capabilities of the human visual framework to capture high-level understanding from enhanced images or recordings. Such a computer vision application gives the machine the ability to recognize a person and detect their emotions in order to process them appropriately. Facial verification can be a process of identifying or validating an object through an image, video, or any audio-visual component of their face. It could be a biometric discriminant proof strategy that works directly and confronts measures of discriminating individuals through their facial design and biometric information. The innovation collects a unique set of biometric information from each individual regarding their face and facial expressions to identify an individual. Facial sensory recognition can be an innovation used to analyse estimates from a variety of sources, such as images and recordings. It makes a difference as machines get better the way humans do and treat them according to their emotions. We are using a deep learning computation called Convolutional Neural Networks (CNN) to prepare for this demonstration that determines the sentiment of certain input images. We need to pre-process the images to prepare and test the model. For pre-processing, we do image enhancement combining resizing, equalizing and converting the image to grayscale for the machine to achieve. This demonstration can have multiple applications in both surveillance and feedback systems.


Literature Survey:
Adversarial examples can deceive machine learning models by manipulating the input with an optimized incorrect classification [1,2].In computer vision, these examples are often created by making small perturbations to an original image, and current algorithms for constructing adversarial examples typically require both architecture and parameter access to the model for gradient-based optimization [2,3,4,5,6].However, these methods are not applicable to humans due to the lack of similar access to the brain.Although adversarial examples may transfer from one model to another, attacking those inaccessible to an attacker, it is unclear whether humans are prone to such manipulations.While humans suffer from cognitive biases and optical illusions, these do not resemble natural image perturbations or loss function optimization [7,8].Hence, it is currently hypothesized that transferable adversarial examples do not affect human visual perception, but this remains untested empirically.
A thorough exploration of the above query presents an opportunity for reciprocal learning between machine learning and neuroscience.Machine learning has often drawn inspiration from neuroscience, with object recognition algorithms being one example.Prior to their development, it was postulated that constructing such algorithms would be feasible due to the human brain's ability to recognize objects, as reviewed by Hassabis et al. [9] in the context of the relationship between artificial intelligence and neuroscience.Verification that the human brain can resist a specific category of adversarial examples would serve as proof that machine learning security can adopt a comparable strategy.Similarly, if it were established that the human brain can be deceived by adversarial examples, machine learning security should re-evaluate its current focus on developing models that are resistant to such examples [2,3,5,10,11,12,13,14]; rather, secure systems that incorporate non-robust machine learning elements should be prioritized.Consequently, if adverse effects of adversarial examples on the brain are discovered in the context of computer vision, such a machine learning discovery could enhance our understanding of brain function.

Existing systems:
Studying facial emotions can aid in investigating human behaviour.The identification of facial emotions involves measuring the eyes, nose, and mouth, as well as their positions.Distance urges were initially used to estimate the strength of facial emotions, categorizing and counting them using high-dimensional 3D transformation and regional volume differentiation maps.Most facial expression systems portray these elements using PCA in films.PCA can be used to identify the action unit responsible for eliciting facial expressions, which can further be used to recognize additional facial expressions.Researchers retrieved the face portion using the active contour model and minimized differences within the face while maximizing the distance between it and the context through Chan-Ves and Bhata charyya's energy function.Optical flow allows for the recovery of geometric aspects of facial emotions and movement features, while wavelet decomposition minimizes noise.For typical ML approaches, such as DL methods, significant computational and memory resources are unnecessary, leading to the creation of embedded devices that can classify in real-time with minimal resources and produce appropriate outcomes [21][22][23][24][25][26][27][28][29][30][31][32][33][34].

Proposed System
The system proposed in this article employs deep learning algorithms, specifically Convolutional Neural Networks (CNN), which have greatly impacted the field of computer vision in the past decade.Using these approaches, tasks such as feature extraction, recognition, and classification can be performed more efficiently.CNNs in particular eliminate the need for physics-based models and require less preprocessing work.By enabling direct learning from input images, DL-based approaches have achieved promising results in areas such as Facial Emotion Recognition (FER), scene analysis, face and object recognition.A typical DL-CNN architecture consists of three layers: a convolution layer, a sub sampling layer, and a Fully Connected (FC) layer.The convolutional layer takes input images or feature maps and applies filters to create spatially organized feature maps.Filters within the feature map layer are connected, whereas feature map layer inputs are locally connected.In the sub sampling layer, feature maps are condensed using techniques such as max pooling, min pooling, or average pooling.The last FC layer of a CNN determines the class probability for a complete input image.Using a CNN, most DL-based emotion recognition approaches can be adapted to different domains.Facial recognition poses a significant challenge due to face recognition errors.Face recognition can be categorized as real-time or photo recognition.This project utilizes visual invariants to identify faces in still photographs.The grayscale distribution of a typical human face is studied to determine the significant variations in intensity.The "average human face" presented below is created from numerous human faces, and properly scaled colour maps highlight grayscale intensity changes.The grey level imbalance is noticeable across all sample locations, with the cheeks, forehead, and nose appearing grey with high light intensity, while the eye-eyebrow region consistently exhibits low-intensity grey levels.

Convolutional Neural Networks
The activation layer works to introduce non-linearity, enabling the network to not only recognize simple features but also combine them to identify more complex ones.The pooling layer, on the other hand, reduces the spatial dimensions of the output, aiding in computation efficiency and preventing overfitting.The training of a convolutional neural network involves minimizing a loss function, measured by the difference between predicted and actual output.Optimizers such as gradient descent are used to adjust the parameters of the network to minimize this error.This process is carried out in batches, allowing for efficient processing of the large amounts of data typically used by these networks.Convolutional neural networks have proven useful in a broad range of applications, including image and speech recognition, natural language processing, and even healthcare.Their ability to learn hierarchical features and process large datasets makes them an invaluable tool in the field of artificial intelligence.The Convolutional Neural Sorting (CNN) has advanced design insights and abstraction capabilities that are superior to systems employing fully connected layers [25].CNNs are intuitive and naturally motivated, allowing them to effectively identify objects and learn theoretical characteristics [8].Weight sharing is a critical concept employed by CNNs that reduces the number of parameters needed to be prepared, resulting in smoother operations and avoiding overfitting [16,32].Compared to artificial neural network (ANN) models, CNNs are easier to train for larger systems [36] and have been widely employed in various fields due to their impressive performance, including image classifiers [5,12], question placement [34], head-to-head position [34], speech recognition, facial recognition [23], vehicle recognition [30], diabetic retinopathy [29], and more.The purpose of this study is to train a hypothetical system that can improve understanding and comprehension of CNNs, discussing basic concepts, three popular models, learning algorithms, and new techniques like ADAM is introduced [11].

Dataset
We have used 2 different datasets training the model.The first dataset we used is the FER-2013.csvdataset which is an open source dataset.This dataset is a Facial Recognition dataset.It includes grayscale images of faces with dimensions of 48x48 pixels.The faces appear in a similar position across all images.The objective is to categorize each face into one of seven emotions: anger, disgust, fear, joy, sadness, surprise, and neutrality.The "train.csv"file contains two columns: "Emotions" and "Pixels".The former column consists of codes ranging from 0 to 6 corresponding to each emotion, while the latter column contains a string of pixel values for each image.The "test.csv" file contains a similar "Pixels" column, and the goal is to verify the emotion for each face.The dataset consists of 28,709 images, while the test set used for evaluation has 3,589 images.This dataset was developed by Pierre-Luc Carrier and Aaron Courville.
The final dataset we have used to improve the model performance is a custom dataset which is obtained by web scraping the Google platform for images in seven different emotions.We have used Python with Beautiful Soup Library to scrape the search results.We use bs4 to import BeautifulSoup to extract the images from the source.We save all the scraped data into our system and segregate them properly for pre-processing.We have collected a couple thousand pictures for every emotion in our output.

Pre-processing Techniques
OpenCV is an open-source Python library specifically designed for computer vision.It is widely used as it provides users with various modules and functions that make it easy to perform image processing tasks using just a few lines of code.This tutorial will cover several image processing techniques in OpenCV.The library provides access to a broad range of artwork and computer vision and knowledge computing technologies offering over 2,500 efficient evaluations.Through these evaluations, users have the ability to recognize and identify faces, objects, and human behaviour in videos, monitor camera movements, track moving objects, create 3D models of objects, produce 3D point clouds using stereo cameras, merge images to create panoramic views of scenes, find similar images from collections, remove red-eye from flash photographs, track eye movements, and overlay augmented reality markers onto scenes.OpenCV boasts a diverse user base of over 47,000 individuals and has been downloaded more than 18 million times.It is a popular library used widely by organizations, academic research institutions, and government bodies.Thresholding is a technique used to manipulate pixel values in an image by setting them to either zero or a maximum value.There are various thresholding methods available for image processing, which can be performed on either grayscale or color images.One of the most commonly used thresholding methods is binary thresholding, which sets pixel values to zero or 255 based on a provided threshold value.Pixels below the threshold value are set to zero, while pixels above are set to 255.In a grayscale image, this results in a black-and-white image, while color images will only contain black, blue, green, red, or combinations of these colors, such as white (B+G+R) or yellow (R+G).Inverse binary thresholding is the opposite of binary thresholding, setting pixels above the threshold to zero and pixels below to 255.This also results in a black-and-white image for grayscale inputs.Truncated thresholding leaves values below the threshold unchanged, only setting values above the threshold to the threshold value.Threshold-to-zero sets pixel values below the threshold to zero and leaves values above the threshold unchanged.Threshold-to-zero-inverse sets pixel values above the threshold to zero and leaves values below the threshold unchanged.
During the process of image processing, grayscale images are usually preferred over RGB images as they require less data and allow for faster complex image operations.In OpenCV, the cvtColor function is used to convert an image from one color space to another by inputting a color conversion code.To convert an RGB image to grayscale, the BGR2GRAY code is used, as OpenCV loads images with the color channel order of Blue, Green, Red (BGR) instead of RGB.OpenCV uses a weighted method for RGB to grayscale conversion, also known as the luminosity method, whereby the grayscale value is computed by taking the weighted sum of the R, G, and B color components.The formula is: Gray = 0.299.R + 0.587.G + 0.114.B

Results Analysis
To be successful, a model requires both proper coding and testing.In our case, our code has been designed to receive input from a device's web camera and produce an output based on the captured emotion.Various user images with different emotive expressions are utilized to ensure the accuracy of our model.Our model also shows the level of confidence while displaying the user's name and emotion.We evaluate the performance of our model through an accuracy score that measures the precision of the results.A confusion matrix is also utilized to determine the amount of correctly predicted sentiment data.Upon inputting an image into our system, the results are instantaneous, and the various types of testing begin.Integration testing is conducted to ensure that the software requirements of our system, such as RAM, OS, and libraries, are capable of processing the code without errors during the compilation process.Once our code successfully runs without errors, the input is displayed with the output, validating the model in our system.Final results on the output screen is as shown in Fig. 1.Confusion matrix is the metric we opted for as the output can one of the seven labels it gives a complete overview of the model performance for every emotion separately.Fig. 2 shows the Confusion matrix.

Conclusion
Facial authentication and expression prediction are technological advancements that have revolutionized the approach to security and surveillance systems.These technologies have been integrated into biometrics and information security systems to facilitate identity authentication, secure sensitive data and restrict unauthorized access to specific areas.Law enforcement agencies have also adopted these systems to identify suspects, locate missing persons, and track criminal activities.They have also become popular in smart cards and payment systems, enhancing transaction security and convenience.Facial expression prediction technology has had significant applications in various fields, including human behavior understanding, mental health, and synthetic human expressions.With the improvements in artificial intelligence and machine learning, it is now possible to employ these technologies to predict human emotions based on facial expressions, voice tone, and body language.This technology can have significant applications in advertising, where the emotional response of a target audience can be analyzed to create effective campaigns.It can Facial authentication and expression prediction technologies have also found their way into customer feedback systems, particularly in public places like restaurants.These systems can provide feedback on the food and service without asking customers orally, making the process more convenient and less intrusive.Facial expression prediction technology can also enhance the customer experience, providing insight into their emotional response to the food, and helping restaurants to improve their menu.Facial authentication and expression prediction technology has become increasingly popular, with significant adoption expected in the coming years.As a result, these technologies are poised to become more accessible and integrated into our daily lives, transforming the way we interact with machines and each other.

Future Enhancements
Our model can be further developed into a model which can suggest the user some activities based on the emotion.For example, if the user is authenticated and is sad, he can be redirected to podcasts which may be helpful for him to become calm and similarly if the user is happy, he can be suggested to share the happiness with his peers.Once the user is identified by the model it can alert other users with the current user's emotion and so on.And the model's accuracy can also be increased using stacked model, transfer learning and some fine tuning on the data.And the model can also be deployed on web to make it easier for users with no prior programming knowledge to use the model with ease.

Fig. 1 .
Fig. 1.The output screen with the user name and emotion with confidence percentages.

Fig. 2 .
Fig. 2. Confusion Matrix /doi.org/10.1051/e3sconf/20234300108282 430 also be used for detecting mental disorders, thus improving the diagnosis and treatment of mental health conditions.