Sign Language Recognition : High Performance Deep Learning Approach Applyied To Multiple Sign Languages.

. In this paper we present a high performance Deep Learning architecture based on Convolutional Neural Network (CNN). The proposed architecture is e ↵ ective as it is capable of recognizing and analyzing with high accuracy di ↵ erent Sign language datasets. The sign language recognition is one of the most important tasks that will change the lives of deaf people by facilitating their daily life and their integration into society. Our approach was trained and tested on an American Sign Language (ASL) dataset, Irish Sign Alphabets (ISL) dataset and Arabic Sign Language Alphabet (ArASL) dataset and outperforms the state-of-the-art methods by providing a recognition rate of 99% for ASL and ISL, and 98% for ArASL.


Introduction
Object recognition is one of the active areas of artificial intelligence that has been studied in recent years.Various methods and techniques of machine learning and deep learning have been suggested and developed to solve object classification and recognition problems.Nevertheless, sign language recognition (SLR) is still a difficult task due to the diversity of languages and datasets.It is a very important area that can put an end to many problems that people with disabilities su↵er from.Deaf people present an active part of society, so the main objective of our work is to help the rehabilitation of these people by implementing artificial intelligence and especially deep learning techniques to recognize the language of signs from images of static hand poses [1].Hand gestures can be divided into poses as shown in figure1 depicted in static images [2] and dynamic poses depicted in videos [3,4].
Among the methods used in SLR, we mention traditional methods 2 based on machine learning such as Support Vector Machines (SVM) [5], Hidden Markov Models (HMM) [6], etc.In addition, the approaches based on Deep Learning [7] is not require any feature extraction and preprocessing step.Due to their power and efficiency, especially in object recognition and image processing, deep learning techniques have attracted the attention of researchers (and industries).One of the most powerful deep learning tools is convolutional neural networks CNN [8], which outperforms many other methods in object recognition tasks, given their ability to work perfectly

Sign Language
Alphabets Recognition

Dynamic Hand
Gesture Recognition with very large datasets [9].
In order to provide a robust model, we choose to work with a large scale data set of sign language alphabets with enough classes and training examples.In practice, deaf people communicate using hand poses, for this reason we need images containing frontal views of hand poses to achieve high recognition rate.To make our work efficient, we worked with an American Sign Language (ASL) dataset which contains 29 classes and 87,000 images with a size of 200 * 200 pixels shown in figure3, an Arabic Sign Language (ArASL) dataset [10] with 54,049 images divided into 32 classes Irish Sign Language (ISL) dataset with 58,114 images [11].
Based on convolutional neural networks, we o↵er a high performance architecture based on the VGG model.ASL and Irish datasets, 98% on ArASL2018 dataset.

Related work
In this section, we will explore some state-of-the-art works done in the field of Sign Language Recognition.We will also present some existed and open source datasets that are available to researchers.[12] proposed a SIFT-based geometrically computational method to recognize Bangla sign language.To process and normalize the hand gesture images, they applied Gaussian distribution and grayscaling techniques.For the feature extraction step, they implemented scale invariant feature transform, after this step, they applied SVM classifier and they obtained a respective recognition rate with 88.33%.Aliaa A et al. [13] developed an automatic Arabic sign language (ArASL) recognition system based on the Hidden Markov Models (HMMs), they implemented a large dataset to recognize 20 isolated words from Arabic Sign Language.The approaches are experimented using real ArASL videos and reaches an accuracy of 82.22%.Among the works based on deep learning approaches, we find the system developed by Lean Karlo S et al. [14] the approach use the Convolutional Neural Network (CNN) and reaches a recognition rate of 90.04% on American Sign Language Alphabet.P.V.V. Kishore et al. [15] proposed the recognition of Indian Sign Language gestures using the powerful tool CNN.They employed the selfie mode on sign language video to perform the recognition process.A dataset with 200 sign in five di↵erent viewing angles used to train the CNN.They tested various architectures to obtain the better accuracy of 92.88%.To obtain an e↵ective Sign Language Recognition and classification results, we need to train our Model architecture on a largescale and diverse dataset that cover all hand-shape poses.
The most released datasets in the literature we mention: the American Sign Language Dataset (ASL), Arabic Alphabets Sign Language Dataset (ArASL) and Irish Sign Language Dataset(ISL).

Proposed approach
The proposed system consists on recognition of Sign Language Alphabets using deep learning and especially Convolutional Neural Networks.Convolutional neural networks are very powerful on two-dimensional images and revolutionized the field of objects recognition given their capacity to operate with structural and defined on grid data [16].An overview of our system is shown on figure4.

DATA
To ensure that our proposed architecture is powerful and e↵ective, we trained it on several datasets such as the American Sign Language Dataset (ASL)  collected from 40 participants and Irish Sign Language dataset (ISL) with 58114 images for the 23 ISL handshapes alphabets.To visualize the given datasets we applied Principal Component Analysis (PCA) [17], which is an efficient method for data dimensionality reduction [18].The figure 5 shows the visualization of the datasets.
The depth of the output layer is defined by the number of the used filters.The convolutional operation from the nth layer to the nth + 1 is defined as follows : The output of the convolution layer is the input of the pooling layer.The pooling operation is de↵erent from the covolution, it consists of selecting a small size region of P n ⇥ P n and returning the maximum value of the selected region.Another parameter is included in the pooling operation which is the stride.if the stride S n > 1, the dimensions of the resulting layer will be : The final layer in the Convolutional Neural Network is the fully connected layer.This layer functions exactly like a traditional feed-forward neural network, and the data is forwarded from the input to the output.The main reason to employ the convolution layer is to increase the e↵ectiveness of the Network.

Proposed Architecture
To construct our Model architecture, we based on VG-GNet.VGGNet architectures conception shows that the depth of the netowork is a crucial factor to obtain better recognition and classification rate.Our Network contains four(04) convolutional layers equiped with ReLU(Rectified Linear Unit) activation function, two(02) pooling Layers and three(03) fully connected.Before feeding the images to our CNN, we need to perform the normalisation step. in every dataset we resized the images, for exemple in ASL dataset we changed the size from 200*200 pixels to 50*50 pixels not for our Network input's requirements , but to reduce dimension of the input images and to have a fast learning operation.The figure6 presents the proposed CNN architecture.

Experiments, comparison and discussion
In this section we will discuss the performances and results acheived by our Model architecture using the ASL , ArASL and ISL datasets.we will also compare our work and results with other state-of-the-art methods done in the field of Sign Language Recognition using Deep Learning methods and especially Convolutional Neural Network(CNN).

experiments and results
The algorithmes were implemented using Python language (we used several libraries that contains visualisation and image processing functions such as T ensor f low T M , Keras T M Models and S klearn T M ).The model run on Microsoft Arure virtual machine with six(06) core processor, 56Go of RAM.Working with Convolutional Neural Networks with a high number of epoches also visualising a high number of data can be time-consuming process, For this reason we used TESLA K80 NVIDIA GPU.The used GPU boost gives a superior performance to our Model and accelerates Libraries and training process.

experiments on ASL and ISL datasets
Before feeding the two datasets ASL and ISL into the CNN, we first splitted them into Train data and Test data.The ASL dataset is divided into 29 classes where 3000 images on each classe are dedicated for training and 200 images for testing.The ISL dataset contains 58120 images in which 200 images are dedicated for testing.BatchNormalization and dropout optimization are used to maintain the e↵ectiveness of our Model and also to avoid the wellknown overfitting problem.A batch size of 32 was selected and the number of epochs was 40.To evaluate the performances of our model, we used the accuracy metric which consists of dividing the well predicted samples by the total number of predictions.The figure 7 represents the accuracy obtained using our CNN architecture on ASL and ISL datasets, the model acheives better score with 99% on both datasets, which is a perfect recognition rate compared with other state-of-the-art-methods.To see that our Model performs well on test images, used the confusion matrix represnted in figure8.We observe in the diagonal of our confusion matrix that all values are equal to 200 which is the the number of samples of each class in the testset. in other side we remarque that all other values are represented as zeros, which means that all images of each class are perfectly recognized by the model.this proves the e↵ectiveness and the high accuracy obtained by our Model.

experiments on ArASL dataset
For ArASL dataset, we used the same methode and experiments, our CNN model is trained using 54050 images of Arabic Sign Language divided by 32 classes (compared with 29 classes for ASL Dataset and 26 for ISL) and reaches a high accuracy of 98%. the figure 9 illustrates the accuracy acheived by our model.Figure 9: Acurracy of our CNN architecture using ArASL It can be seen that the use of our CNN approche on ASL and ISL dataset gives a better results compared with it's use on ArASL.This is explained by the reduction of number of classes(29 classes in ASL, 26 on ISL against 32 in ArASL) of Sign alphabets to be recognized and classified.This is the only reason for the accuracy di↵erence, because the two databases are already cropped and they did not require any pre-processing step.Similar to the previous subsection, we present the confusion matrix of our learning strategy, a perfect recognition rate is acheived.

Comparative results
In addition to our proposed approach, Table2 presents the work done and acheived results in the state-of-the-art of Sign Language Recognition.We observe that our proposed approach shows high accuracy and outperforms the methods done in the field of Sign Language Recognition.The datasets that we used are wide and challeging as they contains divers hand poses positions and styles.Furthermore, our approach is fast and provide less complex structure than other approaches.

Conclusion
In our work, we proposed a high performance Deep Learning architecture for Sign Language Recognition.We used Convolutional Neural Network to recognize static hand pose images.The recognition rate reaches 99% on American Sign Language and Irish Sign Language datasets, and 98% on ArASL dataset.The proposed methode provides less complex architecture and performs perfectly on very large datasets.Our proposed approach performs all other state-of-the-art methods on the field of Sign Language recognition.As a futur studies, we plan to investigate in CNN to performe dynamic hand poses recognition.In addition, we intend to combine several methods and classifiers for dynamic sign recognition.

Figure 2 :
Figure 2: Deep Learning vs traditional Machine Learning based methods.

Figure 7 :
Figure 7: Accuracy of proposed Architecture

DATA Accuracy Deep Learning based approaches Traditional Machine Learning
Our architecture achieves a recognition rate of 99% on EOH) based features and Multiclass Support Vector Machine (SVM) for classification to recognize static hand gestures.The system use the American Sign Language dataset as input and reaches a recognition rate of 93.75%.Fargad Yasir et al.

Sign Language Images Pre-processing and Normalization Split data into Training set and Test set Train our Model using Training Data Model Saving Model Optimisation Design the CNN Architecture & Build the Model Input Hand Pose Image Classification using CNN Predicted class Output
, Arabic Alphabets Sign Language Dataset (ArASL) and Irish Sign Language dataset (ISL).The ASL Dataset contains images of Sign Alphabets divided in 29 classes.The images are 200*200 pixel size and represents alphabets from A to Z in Addition to SPACE, DELETE and NOTHING, These three classes are very important in real-time classification.The ArASL Dataset contains 54049 images of Arabic Alphabets Sign language separated in 32 classes and

Table 1 :
Architecture used for ASL

Table 2 :
Comparison with previous works on Sign Language recognition