Comparing word embedding models for Arabic aspect category detection using a deep learning-based approach

. Aspect category detection (ACD) is a task of aspect-based sentiment analysis (ABSA) that aims to identify the discussed category in a given review or sentence from a predefined list of categories. ABSA tasks were widely studied in English; however, studies in other low-resource languages such as Arabic are still limited. Moreover, most of the existing Arabic ABSA work is based on rule-based or feature-based machine learning models, which require a tedious task of feature-engineering and the use of external resources like lexicons. Therefore, the aim of this paper is to overcome these shortcomings by handling the ACD task using a deep learning method based on a bidirectional gated recurrent unit model. Additionally, we examine the impact of using different vector representation models on the performance of the proposed model. The experimental results show that our model outperforms the baseline and related work models significantly by achieving an enhanced F1-score of more than 7%.


Introduction
Sentiment analysis (SA) is one of the main tasks of natural language processing (NLP) that aims to extract opinions, thoughts, or attitudes toward a specific entity or subject. With the internet revolution, online users are creating large amounts of data every day. Analyzing these unstructured datasets is valuable for many entities (e.g., governments, companies, and customers) and applications (recommender systems). Therefore, SA has become one of the very active fields for researchers in both industry and academia.
SA can be handled on three levels, namely document level, sentence level, and aspect level. The first two levels identify the sentiment polarity of the whole document or sentence, which is not always beneficial. Users can express different opinions about different aspects or entities in the same text or review, making the aspect level, also known as aspect-based sentiment analysis (ABSA), more suitable and practical for real-life scenarios.
In the SemEval-16 task 5 [1], ABSA was divided into three main tasks: aspect category detection (ACD), opinion-target expression (OTE), and sentiment polarity identification. The objective of the first task is to identify aspect categories (E#A pairs) in a given review or sentence from predefined inventories of entities and attributes. For the second task, the aim is to extract the opinion-target expression used to refer to each E#A pair. The last task aims to identify the sentiment polarity of each aspect. Fig. 1 shows an annotated review from the SemEval 2016 task-5 dataset. Aspect-based was widely studied in the English language, while research in other languages such as Arabic is still limited and requires further effort to enhance this task [2]. The Arabic language is one of the four most used languages on the internet. It is the official language of 22 countries with more than 400 million people in the world [3]. Unlike English, Arabic has a rich and complex morphology and many varieties, making the tasks of SA and ABSA more difficult in this language. Most of the proposed methods in Arabic SA adopted rule-based or traditional machine-learning approaches. These methods require a laborious task of feature-engineering and the use of external resources (e.g., lexicons and NLP tools) that are unavailable or unreliable, especially in dialectical Arabic [4]. Recently, deep learning (DL) methods have demonstrated their effectiveness in handling many NLP tasks in English, including SA, without the need for manually engineered features or linguistic resources [5]. However, these techniques are under-discovered in Arabic, and a limited number of DL methods were published, especially in Arabic ABSA (AABSA). Therefore, the aim of this paper is to investigate the use of a bidirectional gated recurrent unit (BiGRU) model in handling ACD, which is, to our knowledge, the first time to apply this model to tackle this task in Arabic. Additionally, the impact of using different word embeddings models is also examined.
The key contributions of this paper are: • Applying a BiGRU model to enhance the Arabic ACD task.
• Train a word2vec model for the Arabic Hotel domain and examine its effect on the performance of the proposed model. As far as we know, this is the first time to apply a domain-specific embedding model in Arabic ABSA.
• Comparing traditional word embeddings with contextual embeddings in Arabic ACD task.
• Outperforming the baseline and related work models evaluated on the same reference dataset.
The remainder of this paper is organized as follows: related work to the Arabic ACD task is discussed in section 2. Section 3 provides the research methodology. The dataset and baselines are presented in Section 4. Section 5 discusses the results. Conclusion and Future work directions are provided in Section 6.

Literature review
AABSA is receiving less attention from the research community compared to the English language. Meanwhile, the ACD task is less investigated than other ABSA tasks, namely aspect term extraction and aspect sentiment polarity classification. For example, we couldn't find any Arabic ACD work published in 2019 or 2020. The first work in AABSA was that of Al-Smadi, Qawasmeh [6]. They provided the first annotated AABSA corpus following the annotation guidelines of SemEval 2014 task 4 [7]. They also reported a set of baseline results for each ABSA task, including ACD, which achieved an F1-score of 15.2%.
Obaidat, Mohawesh [8] tried to enhance the previous results using a lexicon-based approach. Fourteen lexicons were built for each category using seed words and Pointwise Mutual Information (PMI). The proposed method looked up for the words of a given review in each lexicon to sum up their weights and then assigned the review to the aspect category with the highest sum value. However, the improvement over the baseline's result was not impressive (23.4% vs. 15.18%).
In SemEval 2016 task 5 workshop, a new ABSA corpus was provided for the Arabic language [1,9]. The dataset was selected from a Hotel reviews corpus [10] and was annotated on sentence-level (SB1) and text-level (SB2) for three different slots, ACD (slot 1), opiniontarget expression (slot 2), and sentiment polarity (slot 3). The baseline models for ACD achieved an F1-score of 40.3% in SB1, and an F1-score of 42.7%, in SB2.
In Ruder, Ghaffari [11], the authors introduced their submitted models to the previous competition in multiple languages, including Arabic. For slot 1, they implemented a multilabel model based on a convolutional neural network (CNN) that outputted probabilities over category labels using a threshold. They initialized word embeddings using 300-dimensional GloVe vectors for English, whereas the vector representations were randomly initialized for the other languages. The proposed model obtained an F-1 score of 52.1% in Arabic.
Tamchyna and Veselovská [12] presented their model that was also submitted to the same competition in slot 1 in many languages, including Arabic. They implemented a binary classifier for each category (E#A pair) using a Long short-term memory (LSTM) as an encoded layer to capture long-term dependencies in the data and reduce the need for feature engineering and linguistic tools, and a logistic regression classifier to output label probabilities. The model achieved an F1-score score of 47.3% for the ACD task.
The authors in Al-Smadi, Qawasmeh [13] tried to enhance the previous results by implementing two different models, one based on support vector machine (SVM) and the second one based on recurrent neural network (RNN). Both models were trained using lexical, morphological, syntactic, and semantic features. RNN achieved slightly compared results to the best model in SemEval 2016 by achieving an F-measure of 48%. Meanwhile, SVM obtained the best results and outperformed RNN by more than 45.4% in F-measure. However, the execution time of RNN was faster than that of SVM.
Another attempt to enhance the previous results was proposed in Al-Dabet, Tedmori [14]. They handled the ACD task as a multi-label classification problem using a binary relevance (BR) classification approach. The proposed model combined CNN with IndyLSTM [15] and was decomposed into a set of independent binary classifiers to train every single aspect category independently. The experimental results achieved an F-1 score of 58.1%.
Unlike the previous methods, our model is implemented based on a BiGRU model. Additionally, different vector representations are investigated, including word-level, character-level, domain-specific, and contextualized word embeddings models.

Background
This sub-section presents the technical aspects of the main components used to build our model.

Word embeddings
Word embedding models represent a text as a numerical vector representation by mapping each word in the corpus vocabulary to a set of real-valued vectors in a vector space of dimension N. Semantically associated words tend to appear near to each other and have similar vectors representations. One of the earlier word-embedding models was proposed by [16]. They implemented two different models, namely CBOW and skip-gram. The CBOW model predicts the current target word based on its surrounding words within a fixed window, whereas the skip-gram model predicts context words within the range of a fixed window based on a given current target word. However, one of the main drawbacks of these models is that they are unable to resolve polysemy disambiguation.
Recently, contextualized word embeddings or language models have overcome this limitation by providing different representations for the same word depending on the context it occurs. One of the first contextualized models was ELMO [17], which assigns an embedding to each word using a bidirectional LSTM (BiLSTM) model to look at the entire sentence. Another famous model is BERT [18], which is a deeply bidirectional and unsupervised language representation model implemented based on the transformer architectures. It learns information from both the left and the right sides of a token's context during the training phase. The BERT model has achieved state-of-the-art results in many NLP tasks, including SA, question answering, and named entity recognition.

GRU network
Gated recurrent unit (GRU) is a form of RNN, an Artificial Neural Networks (ANN) that was implemented to recognize the sequential information in data and predict the next likely scenarios using patterns. The main limitation of RNN is the problem of gradient vanishing. Therefore, LSTM and GRU were introduced to overcome this shortcoming. Compared to LSTM, GRU has a less complex structure combining the input and forget gate of LSTM into a single update gate and combining the hidden and cell states into a single hidden state. Fig.  2 illustrates the architecture of GRU. GRU networks can only manage sequences from front to back, which leads to information loss. Therefore, bidirectional GRU was widely used in many NLP tasks, including SA, text classification, named entity recognition, and question answering, e.g., [19][20][21][22] to process data in both directions and provide complete context information.

Model overview
The dataset we used in our experiments assigned more than one category to the same review. Thus, we considered the ACD task as a multi-label classification problem. The proposed model is composed of multiple components: an embedding layer to generate the vector representations of each word, a BiGRU layer to remember long term dependencies and extract text information from both the left and right side at the same time, followed by Average and Max Pooling to reduce dimensionality, a dropout layer to reduce overfitting and prevent co-adaptation of hidden units, and a dense layer with a sigmoid activation function to output probabilities over each category. Fig. 3 illustrates the used model. The following word embeddings models are also examined: • Word-level representations: The model we used to generate these representations is AraVec [23], an Arabic version of word2vec. The AraVec models were trained using three different dataset resources, namely, Twitter, Wikipedia, and Common Crawl webpages crawl data for both skip-gram and CBOW models.
• Character-level representations: we used FastText [24] to generate the character-level embeddings, an extension of word2vec. FastText computes embeddings for character ngrams. Words are then represented as the sum of their character n-gram embeddings. FastText has released versions for 104 languages, including Arabic. The Arabic version was trained on the Wikipedia dataset with a dimension size of 300.
• Domain-specific word embeddings: pre-trained word embeddings are generally trained on general domain corpora, which cannot capture specific local meanings within the domain. Many researchers tried to overcome this problem by training domain-specific embeddings, which has shown promising results in many NLP tasks, including SA, e.g., [25][26][27]. Therefore, we attempted to evaluate the impact of using domain-specific word embeddings on the whole performance of the proposed model. • Contextualized word embeddings: these vector representations were generated using language models (LM). For the Arabic language, there exist two types of LM, namely monolingual and multilingual models. The first type is trained only on Arabic datasets such as Hulmona [28], AraBERT [29], AraGPT2 [30], whereas the second type is trained on a multilingual corpus such as mBERT, which provides sentence representations for 104 languages, including Arabic. The AraBERT model was adopted in this research to provide the contextualized embeddings by concatenating the representations from the last four hidden layers.

Dataset
The dataset used in our experiments was released by the SemEval task 5 workshop in 2016. It was selected from the Arabic Hotel reviews provided by ElSahar and El-Beltagy [10]. The annotation was performed on both text-level (2291 reviews) and sentence-level (6029 sentences). This research performed the ACD task only on the sentence level. The total number of predefined categories is 35, combining two parts: entity (E) and attribute (A), and each sentence can be assigned to more than one category. The distribution of the dataset over aspect categories is illustrated in Table 1 [9]. We notice that the dataset was split by the ABSA SemEval-16 [1] into train (4802 sentences) and test (1227 sentences) sets. Additionally, we used 10% of the training data for validation.

Baselines
The proposed model is compared against the following baseline models: • SVM-based: is the baseline model released with the SemEval Hotel dataset based on an SVM classifier along with n-gram features.
• Recurrent-based: we compared BiGRU against other recurrent neural networks by replacing the BiGRU layer in the proposed model by BiLSTM, LSTM, and GRU, respectively, and evaluating them using all the examined word embeddings models.

Experimental environment and parameters
In this research, the word-level embeddings were generated using an AraVec model trained on a Twitter dataset using a skip-gram model with 300 dimensions. For domain-specific embeddings, we trained a word2vec skip-gram 300 dimensions model using a dataset selected from the corpus provided by Elnagar, Khalifa [31] that contains ≈ 100k Arabic Hotel reviews. Additionally, The AraBERTv01 model [29] was used to generate contextualized embeddings. This model was implemented following the BERT base version with 12 transformer blocks, 768 hidden dimensions, 12 attention heads, a maximum sequence length of 512 tokens, and a total of ∼110M parameters. It was trained on two corpora [32,33] and scrapped articles from Arabic news websites, and it has a vocabulary size of 64k tokens. Table 2 illustrates the models' parameters employed in our experiments. Additionally, the experimental environment is presented in Table 3.

Preprocessing
The preprocessing phase consists of removing elongation, punctuation, diacritics, numbers, and non-Arabic words.
For models with BERT word embeddings, data transformation into a compatible format to BERT was applied as follows: • The data was tokenized to break down sentences into tokens. We notice that no segmentation was required because we used the AraBERT v01 in our experiments.
• Special characters ([CLS] and [SEP]) were added at the beginning and the end of each sentence. We notice that other steps, including stemming, stop words removal, and normalization, were ignored because they have decreased the performance of the models, especially those using BERT word embeddings. Thus, this shows that applying these operations has led to information loss and has affected the contextual information learned by BERT.

Evaluation metric
The standard metric for evaluating the ACD task is F1-score, which is computed using the following formula:

Results and discussion
The experimental results are presented in Table 4. They show that the proposed model outperforms the baseline by more than 25% in F-1 score and achieves better results than related work models evaluated on the same dataset. The BiGRU-based model outperforms more complicated models (e.g., C-IndyLSTM) by more than 1% in F-1 score using the same AraVec embeddings and more than 7% using contextualized embeddings. Moreover, GRU achieves better results than LSTM and BiLSTM. This can be justified by the fact that GRU has fewer parameters than LSTM, making the learning process easier. Furthermore, GRU performs better than LSTM under small-size datasets [34], which is the case for most Arabic corpora, especially in ABSA.
Meanwhile, the BiGRU model outperforms GRU and BiLSTM outperforms LSTM, which shows that the Bidirectional models are more effective in handling this task. For the embedding models, AraVec embeddings obtain better results than FastText embeddings with all recurrent neural models. Moreover, domain-specific embeddings prove to be better than generic embeddings, even that they were trained on a small-size corpus compared to the corpora used to train the standard word embedding models. Hence, this shows the importance of integrating domain-related knowledge in learning word embeddings.
Furthermore, contextualized embeddings achieve the best results in this task (F1-score: 65.5% using BERT+BiGRU). This can be due to the ability of the BERT model to resolve polysemy and Out of Vocabulary (OOV) problems compared to free-context word embeddings. Therefore, this proves that pre-trained language models are better than contextindependent word embeddings in handling the morphological complexity and the ambiguity of the Arabic language in this task. Table 4. Results of the aspect category detection task.

Conclusion and directions for future work
This work proposed a BiGRU-based model to perform the ACD task on an Arabic reference dataset. Moreover, we examined the impact of using different word embeddings, including word-level, character-level, domain-specific, and contextualized word embeddings. Experimental results show that BERT-BiGRU achieves the best F1-score (65.5%) over the evaluated models and significantly outperforms the baseline and related work on the same dataset. Moreover, domain-specific embeddings have shown promising results, even that they were trained on a limited size dataset. Therefore, future work includes investigating the effect of training word embeddings using a larger domain-specific corpus on the performance of the examined models. Moreover, we plan to further pre-train BERT on in-domain datasets to improve the achieved results. Additionally, we intend to investigate several data augmentation methods to handle the imbalanced labels issue in the used dataset and examine their effect on the proposed model's performance.