Hand gesture recognition and voice conversion for deaf and Dumb

—In this paper, we purpose a Hand gesture recognition model which can be used in real time application. This model is based on the mediapipe frame work of the google, Tensor flow in openCv and python and classification using feed forward neural network with keras model. The structure of the proposed work consists of 3 modules: Grabbing the frames, detecting hand landmarks and classification. The proposed model has the accuracy 95.7% at recognizing 10 kinds of hand gestures(Thumbs up, Thumbs down, Peace, Smile, Rock, Ok, Fist, livelong, call me, stop). A hand gesture recognition model that reacts rapidly and with generally acceptable accuracy is one of this work’s primary achievements and pre trained model for feature extraction. The unique approach of the suggested approach is that it detects hand landmarks using Google’s MediaPipe, which is faster and more accurate than traditional methods that rely on geometry, form, and edge data. For modelling sequence data and for recognising gestures, the LSTM model has proven to be quite successful.


INTRODUCTION
These days, we don't need as many complicated methods to do tasksbecause most of them are automated thanks to technology. The disabled, however, are not benefiting much from this automated environment, and the deaf and dumb people are still not developed since they find it difficult to interactwith others. And one of the key reasons is because they communicate differently from average people, and the development of technology has not given persons with disabilities in particular considerable attention. So that is the key justification for selecting a project that could benefit them. The HGRSLTVprogram, which stands for "Hand Gesture Recognition of Sign Language for Text and Voice Conver-sion," enables deaf and dumb people to communicate with one another by observing and tracing the movements of their hands. Using a web camera, hand motion detection is possible. Human-computer interaction technology is actively researching gesture recognition. It can be used for a variety of things, including music production, robot control, sign language translation, and virtual environment control. In this project on hand gesture recognizion and voice conversion, we'll use the MediaPipe framework and Tensor flow with OpenCV and Python to create a real-time hand gesture recognizer. Built on C/C++, OpenCV is a real-time framework for computer vision and image processing. The OpenCV-python package will be used to use it on Python, though.

Problem statement
Due to birth abnormalities, accidents, and oral infections, there has been a sharp rise in the number of people who are deaf and dumb in recent years. Deaf and dumb persons must rely on some form of visual communication since they are unable to communicate with regular people. Around the world, many different languages are spoken and translated. The term "Special Persons" refers to people who have difficulties hearing and speaking. "The Dumb" and "The Deaf" people, respectively, have difficulty understanding what the other person is attempting to say. Sometimes individuals will misinterpret thesecommunications using sign language, lip reading, or lip sync

Objectives
Gesture recognition is an important subject in Human Computer interaction research. It has wide range of applications, including virtual environment control, sign language translation, robot control and music compositon. Our target is to create a real-time Hand Gesture recognizer using the mediapipe framework and tensor flow in OpenCv and python in this machine learning project on Hand Gesture Recognition.

LITERATURE SURVEY
For the goal of recognizing gestures and signals related to sign language, numerousstrategies have been put forth by researchers in the form of patents and researchpapers. [ [4] In this research, we provide a novel deep learning-based method for3D hand gesture detection. We suggest a brand-new Convolutional Neural Network (CNN) in which parallel convolutions are used to handle sequences of hand-skeletal joint locations [5],and we then assess how well this model performs on classification tasks involving hand gesture sequences. Our model doesn't use any depth images; it solely uses hand-skeletal data. Gurnani, Mavani, and Gajjar, V.
[14] The user interface methods of today, such as those involving a mouse, keyboard, touch-pen, etc., are insufficient due to the widespread advancement of computing. Using the use of Machine Learning and Computer Vision, it is possible to draw users by directly usingtheir hands or hand movements as an input device. You may easily draw various forms in this human-computer interaction programme. A model can be developed using the pixels First approach, the disadvantage of this approach would be that like traininga neural network to extract this embeddings is like much harder task. But, in this purposed system we actually don't have to train it since pre made like ready neural networks that can be used for this task and that's where like Google's mediaPipe In this project the approach introduced is that we have reliable neural network that's capable of likeextracting hand landmarks, training a neural network on kind of land marks to recognize the gestures is a simpler task than training it on the pixels because hands can be of different shape, they can have different finger length, they can have different skin color and lighting, cameras and backgrounds can bedifferent.So, for us to train to train like this simple system we need to collect a lot of trainings but, withhand landmarks kind of neural networks is easy and once it's extracted the land mark it's pretty much all the data that is smaller neural network needs in order to do classification. So, this is really powerful pre-trained neural network that's extracting data. So, to be able to train a much smaller network that's much easier and lighter to train and that requires lot of less training images. So, for instance, if you want to train a peace sign and all sort of different background or different orientations. So, that's the power of this approach that we are using to extract the land marks. So, that we can train a much smaller and simpler neural network to perform the actual classification. MediaPipe Hands is a high-quality hand and finger tracking system. It estimates 213D hand landmarks from a single frame using machine learning (ML). Unlike other cutting-edge systems, which mostly rely on potent desktop settings for for inference,our method provides real-time performance on a cell phone and even scales to numerous hands .The first step in the landmarks first approach used in this project is:
Grabbing the frames that in our case it's from webcam and grabbing them in such a way that we can interact with them and work with them. In python, we do it using a library called open 2.
All the frames or images in computer represented asthe rgb matrices. After gettingrgb matrices by OpenCVwhich contains all the information that's in the image.

4.
After getting the rgb matrices . we pass them through the convolution neural network. This neural network is supposed to extract hand land marks. For this purposewe make use of media pipe which is a pre-made like ready neural network 5.
After detecting the landmarks. We take and pass those landmarks into a much smaller and simpler neural network Architecture like a feed forward architecture.
And that gives us like the final classification showing what sign the hand is showing.  There is a problem of hand being in all sort of different places like, if hand is little bit closer or if it is little bit farther away i.e. if distances kind of change happens. so, to handle the scaling problem, These absolute pixel values in terms of frame has to be converted into some relative values because, the hand which is near to the camera and the hand which is far from camera should have relatively same values and that is done through the process called normalization.
In the process of normalization, the absolute pixel values are normalized in between the range of -1 to 1. So, firstly list of absolute pixel coordinate values gets passed through a pre-processed landmarks function.

•
The first thing done in this function is converting whole list to relative coordinates. This can be done through taking the base point with the index 0 and subtracting that base point from all of the other points • So, in this way by referring the separation for each of the points , each of the points represented as of their distance to the base point(in this case wrist point is the base point).
• Every point in the list gets converted and stored in a 1-dimensional list • After getting relative coordinates, taking the maximum value by converting all values as a absolute length of list and dividing all the distance pixel values by the maximum value we achieve normalized list • The intuition here of pre-processed land mark list is that it's normalized value between -1 to 1 and they are determined how far away they are from the base point.

Training the model
In this paper, we recorded our data and created a small dataset of around 200 images for each sign.Firstly, the name of the sign is added with the index value from 0 to 9.
Inside the OpenCV, press the key "k", it displays as the mode: Logging key point then, in this mode, to add up the classes with the respective index of number pressthe respective key (From 0 to 9) and that sign will be added to the data set. Every sign in the data set is assigned with a index. To train the sign, start clicking the key numbers assigned to that particular sign by changing the various positions and orientations. It is found like taking a snapshot of the images and saves various key points under the label number. The next step is loading the data and determining the number of classes in the data set and defining the model using keras. On running this model, training takes place very quickly.so that it looks like it's going to cheer up for 1000 epoch in reality The whole neural network architecture contains 20 neurons followed by 10 neuronsin this case which is really a small architecture. There is chance of adding more neurons to the feed forward network, if there is arequirement of more classes.

EXPERIMENTAL RESULTS
In machine learning, the effectiveness of a model is often assessed using a Confusion Matrix. This matrix can be generated in Python using the OpenCV. To evaluate a model's performance in predicting hand gestures, we obtained experimental datasets and generated a Confusion Matrix to determine its accuracy.
The implementation of machine learning for hand gesture recognition using MediaPipe yielded excellent results, as demonstrated in Figure 4 and Figure 5. Figure 4 depicts the prediction of ten different hand gestures, revealing that some gestures were incorrectly identified. Figure 5 shows a validation accuracy of 95%for recognizing hand gestures.

Conclusion.
In this Hand Gesture Recognition project, we used Python and OpenCV to create a hand gesture recognizer. For the detection and gesture recognition processes, we used the MediaPipe and Tensorflow frameworks, respectively. Here, we've learnt about file management, popular image processing methods, the fundamentals of neural networks, etc.
The project is a straightforward illustration of how CNN may be used to tackle computer vision problems very accurately. A 100% accurate finger spelling sign language translator is acquired. By creating the necessary dataset and training the CNN, the project can be expanded to include other sign languages.