Arabic Topic Identification: A Decade Scoping Review

. Arabic topic identification is a part of text classification that aims to assign a given text a set of pre-defined classes (i.e., topics) based on its content and extracted features. This task can be performed using rule-based methods or data-driven approaches. These latter gained more popularity since they require much less human effort to accurately classify a large number of documents. Due to the tremendous growth of Web contents primarily in news websites and social media, topic identification had received a great deal of attention over the last years, and has become a cornerstone of both search engines and information retrieval. The Arabic language is the fourth most used language on the web and records the highest growth in the last two decades (2000–2020). Based on these facts currently available, it seems fair to look closer at the advancements in the Arabic topic identification in the last decade. To this end, we performed the first of its kind scoping review that addresses recent studies in the field of Arabic topic identification that follows the PRISMA-ScR guidelines. This review is based on various online bibliographic databases (e.g., Springer, ScienceDirect, and IEEE Xplore) and datasets search engines (e.g., Google Dataset Search).


Introduction
The World Wide Web has become the main destination for searching and retrieving information. Broadcasting platforms are rich sources of data regarding diverse topics and form a valuable foundation for research. Further, with the rapid growth of machine-readable documents, the demand for Text Classification (TC) is increasing especially for Web page content filtering, Web information retrieval, and e-mail spam detection.
The amount of Arabic Web content has been increasing dramatically over the last two decades. Besides, using Arabic on the Web grew around 9,348.0% in the last two decades (2000-2020) scoring the highest growth of the ten top used languages †. What's more, according to Google Translate ‡, Arabic is among the top 5 common languages used in document translation. There may be many reasons for this, among them the fact that there were, till March 31, 2020, more than 447 million Arabic speakers around 29 countries and 53% of them have access to Internet. And finally, there is an excess of 1.9 billion Muslims worldwide who use Arabic to perform their daily prayers of which 76.5% of them are not Arabic native speakers [1].
Since Western scholars were the forerunner in topic identification, a lot of effort and experiments were mostly devoted to English. However, many other languages like Arabic, Persian, and Malay are catching up. Unlike previous decades, the last one has witnessed remarkable progress in building Arabic corpora which paved the way for highly data-driven approaches like * Corresponding author: elkah.anoual.mri@gmail.com † https://www.internetworldstats.com/stats7.htm ‡ https://blog.google/products/translate/ten-years-of-google-translate/ topic identification. Among many relevant works devoted to Arabic corpora in the last decade, these two surveys [2,3], which are probably the very rich and recent ones that list dozens of freely Arabic corpora and language resources. A few major examples will be mentioned in Section 4 to give an idea of the variety of what is available. Furthermore, several Machine Learning algorithms have been successfully implemented in automatic Arabic topic identification namely Naïve Bayes, Support Vector Machine (SVM), decision tree, k-Nearest Neighbours (K-NN), Artificial Neural Network (ANN) [4].
Following the PRISMA-ScR guidelines [5], this paper provides a scoping review of the major progress that has been made in Arabic topic identification; primarily, the available datasets, the implemented approaches, and the challenges faced by researchers to carry out such process. In addition to compiling the corpora or datasets that will be used for the training and testing of the selected models, most automatic topic identification systems consist of three main phases (i.e., text pre-processing, features selection, classification). Therefore, the main findings of our scoping review are listed for each phase separately.
The general concern of this paper is to provide, for new researchers in the field as well as advanced practitioners, a closer look at the advancements in the Arabic topic identification in the last decade; yet, to suggest a number of recommendations from a point of view of the authors for better Arabic topic identification.
The remainder of this paper is arranged in seven main sections. In Section 2, we have addressed issues related to Arabic language characteristics that have a great impact on topic identification task. Section 3 present the methodology adopted to perform our review. The next sections are dedicated to the review's findings. Section 4 covers relevant and available Arabic corpora related to Arabic topic identification. In Section 5, we introduce the commonly text pre-processing techniques that have been involved to improve the classification performance. In Section 6, we attempt to list popular text representation methods used for the features selection phase. Section 7 presents the machine learning algorithms relied to Arabic topic identification. Finally, we conclude this paper in Section 8.

Arabic language characteristics
Alongside its worldwide use and its significant presence on the Web, the Arabic lexicon and grammar are deeply rooted and well established a long time ago. As a rich Semitic language, its characteristics have intrigued lexicographers and morphologists for centuries. Therefore, it seems obvious that Arabic is an interesting language and a fruitful area of research that receives attention from several scholars, especially those working in Natural Language Processing (NLP).
Arabic is a challenging language for existing topic identification techniques because of its characteristics. First, the morphology of Arabic differs in the structure of affixes from Indo-European languages [6]. Therefore, using language-independent stemming or lemmatization methods for the pre-processing stage or adopting the same concept used for other languages-mostly Englishmay lead to different results than expected [7,8]. Almost all the Arabic contemporary texts are written without the diacritics, knowing that these vowels can encode grammatical category as well as feature information. Consequently, the same word may be spelled in different ways, which posed some difficulties to several automatic processing systems including Arabic topic identification [9].
The Arabic language richness is due to the fact that hundreds of words that have different meanings are generated from one root [10]. For instance, the words " َ ‫ُع‬ ‫م‬ ‫ِّم‬ ‫ل‬ " (i.e., teacher), ‫م"‬ ْ ‫ِّل‬ ‫"ع‬ (i.e., science), ‫م"‬ ‫ل‬ ‫"ع‬ (i.e., flag), " َ ‫ع‬ ‫ة‬ ‫الم‬ " (i.e., sign) are derived from the same root ‫."علم"‬ What's more, it is estimated that the average number of possible part of speech tags for a word in most languages is 2.3, whereas in modern standard Arabic is 19.2 [11]. For example, the three consonants ‫"عقد"‬ <Eqd> can stand for the verb "َ َ ‫د‬ َ ‫ق‬ َ ‫"ع‬ <Eaqada> "he holds", or for the noun ‫د"‬ ْ ‫ق‬ ِ ‫"ع‬ <Eiqod> "necklace", among other part of speech tags. Given the importance of dealing with synonyms during the topic identification process, Arabic becomes very challenging since a word generally has a lot of synonyms. The most popular example in this § https://ar.wikipedia.org/wiki/ ‫قائمة_أسماء_األسد_في_اللغة_الع‬ ‫ربية‬ context is the word "lion" which in Arabic has between 350 to 500 synonyms § . Finally, an Arabic word can represent a whole sentence through sequential concatenation. For example, the Arabic word from the 28th verse of chapter 11 (sūrat Hud) <AanulozimukumuwhaA> means in English "Should we compel you to accept it".

Methodology
Generally, an automatic topic identification system consists of three main phases. Figure 1. is used to illustrate these phases. The first one is the text preprocessing phase which comprises different techniques that deal with the text structure and the language characteristics. For instance, tokenization, stop words removal, stemming, and lemmatization are among these techniques. The next phase is the features selection, which aims to transform the pre-processed text into suitable features [12]. These extracted features will then represent the initial text in a format that will adequately fit the requirements of a selected machine learning algorithm in the classification phase. This latter is the final phase where a classification algorithm is trained to identify a probable topic of a given input text. To shed light and report relevant works published and freely Arabic datasets released in the last decade in the field of topic identification, we performed this scoping review based on various online bibliographic databases (e.g., Springer, ScienceDirect, IEEE Xplore, and Google Scholar) and popular datasets search engines (e.g., Google Dataset Search and Linguistic Data Consortium).
Unlike systematic reviews, scoping reviews do not normally evaluate the quality of the works covered. Scoping reviews focus on the subject investigated and may include more distinct studies [5]. Besides, our review is conducted according to the Preferred Reporting Items for Systematic Review and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) [5]. To the best of our knowledge, this review will be the first of its kind in the field of Arabic topic identification that follows the PRISMA-ScR guidelines. Figure 2. displays the PRISMA-ScR flow diagram adopted in this study. At first, we retrieved 762 records by searching previously mentioned databases and data search engines, which returns numerous results based on the keywords used in a search query. Here is an example of a search query: ("Arabic" OR "Arab") AND ("Corpora" OR "Corpus" OR "Dataset") AND ("Topic identification" OR "Text classification" OR "Text categorization" OR "Document classification" OR "Document categorization"). After excluding many records according to the predefined criteria (see Fig. 2.), we identified a total of 57 records (i.e., publications and datasets). Note that, this search was carried between the 7 th and 25 th December 2020.

Corpora and datasets
Corpora have been widely used and significantly affect a wide range of NLP tasks and applications. The availability of suitable corpora has greatly influenced Arabic topic identification. 182 is the retrieved Arabic datasets that have been released or updated during the last decade (2010-2020) according to our search in popular datasets search engines namely the European Language Resources Association (ELRA) catalogue ** , the Linguistic Data Consortium (LDC) catalogue † † , Google Dataset Search ‡ ‡ , and SourceForge § § . Figure 3. exhibits the number of released corpora for each year.  Among these corpora, there are few ones that are freely available and devoted or suited for training and evaluating Arabic topic identification: • SANAD *** corpus [13]: This corpus has three different datasets that were annotated using single categories (Culture, Finance, Medical, Politics, Religion, Sports and Tech). The SANAD corpus is composed of 110,900 articles collected using web scraping such as Python Selenium, from three news websites (alarabiya.net, alkhaleej.ae and akhbarona.com); • NADiA † † † corpus [14]: It consists of two multilabel datasets, the first dataset comprised of 35,416 news articles collected from SkyNewsArabia.com. These articles were annotated using up to eight labels. The second dataset included 451,230 news articles collected from Masrawy.com. For both datasets, the labelling was done manually to select labels of the most occurring tags in the dataset; • OSIAN ‡ ‡ ‡ corpus [15]: It is a web-derived corpus that was compiled based on 31 different international Arabic news broadcasting platforms. With a serverfriendly crawling policy, the corpus consists of about 3.5 million articles comprising more than 37 million sentences and roughly 1 billion tokens.

Text pre-processing techniques
Due to the challenges presented by the Arabic language characteristics, many studies recommended involving text pre-processing techniques to effectively reduce the data dimensionality and enhance the topic identification [12]. According to our review, the most pre-processing techniques that have been involved while performing Arabic topic identification are stop words removal (27 publications), stemming (16 publications), and lemmatization (5 publications). It is worth mentioning that these techniques were used either individually or in combination. Further, some of the stemmers, lemmatizers, stop words lists were first published before 2010 but were reused later after 2010.

Stop words removal
A stop word removal task aims to remove the most frequent words that appear in a processed text and provide no significant indications to the topic of the text.
Removing stop words requires first a compilation of a list of stop words. The compilation process usually relies on two methods. The first one is a rule-based method that includes morphological analysis (e.g., [21]). This method produces a domain-independent list that is dedicated to general use. The second one is a statistical method that consists of using a frequency feature of a particular corpus (e.g., [22]). This method generates a domain-dependent list that is used for a particular domain.

Stemming
Stemming is used to reduce inflected words to their stems or roots depending on the method applied. The † † † † https://www.researchgate.net/publication/326060650_ NADA_A_New_Arabic_Dataset stem-based method commonly removes only clitics from a given word without trying to find its root (e.g., [23]). This method keeps the meaning of text content well represented since it regroups the inflected words that are grammatically related. Nine separate studies [24] reported that using the stem-based method improves Arabic text classification performance. Whereas, the root-based method showed no improvement in the classification ( [25]).

Lemmatization
Text lemmatization intends to regroup semantically related words written in different syntactic forms and associates them to their lemma. Unlike stop words removal and stemming, using lemmatization for Arabic topic identification is fairly limited since most Arabic lemmatizers are proprietary and not publicly available. However, several investigations reported that applying lemmatization enhances the Arabic topic identification (e.g., [12] and [29]).
In our review, we found few Arabic lemmatizer available for free (like Farasa [30] and Madamira [31]) that have been involved in Arabic topic identification.

Features selection methods
Generally, a topic identification system includes preprocessing phases which can be text cleaning, extraction of features, selection of features, and finally the classification phase. The purpose of pre-processing tasks is to enhance the efficiency of the selected classifier. The text cleaning usually includes tokenisation, lower-case conversion, stop words removing, and word stemming. The extraction phase of features normally relies on the text representation model and the vectorization process. Based on our review, Bag Of Words (BOW), Bag Of Concepts (BOC), CountVectorizer, Tf-idfVectorizer, Chi-squared test (χ 2 test) are the common techniques used for this task [32][33][34].
The BOW model expresses a document in a compact representation of its textual content. This model commonly formed using words as features, where text in a document is represented by the words it is composed of and the document is classified into a category based on the proportion of words that it has in common with other documents from the same category [35].
The BOC representation model intends to unit words that have the same concept; yet, synonymous words are also mapped to the same concept which provides that unique meaning they share. Words with multiple ‡ ‡ ‡ ‡ https://github.com/shammur/Arabic_news_text_classifi cation_datasets meanings are mapped to different concepts based on the surrounding text. In order to use the BOC representation model in an ATC system, a knowledge base such as WordNet, Open Directory Project, or Wikipedia is needed to provide concepts [36].
The vectorization process is performed to turn the text into vectors. Then, the numerical vectors are used to represent the text where each sentence is being represented by one vector. CountVectorizer provides a number of the frequency with respect to the index of vocabulary whereas Tf-idfVectorizer takes into account overall documents of weight of words. The TF-IDF Vectorizer is composed of two terms: • Term Frequency (TF): it calculates the frequency of a token in a document. Since every document is different in length, a given token may occur much more times in long documents than shorter ones.
• Inverse Document Frequency (IDF): it calculates how important a word is by weighing down the frequent tokens and scale up the rare ones.
The Chi-squared test is utilized for verifying the relationship between the appearance of a particular word and the appearance of a given topic. If no relationship exists between the word and the topic; then, they are independent, which is the null hypothesis of the Chi-Squared test [34].

Text classification algorithms
The availability of datasets, that have grown significantly in last decades on the Internet, entails the use of data-driven approaches in text categorization. As our emphasis, in this work, is the Arabic language, we pay more attention to techniques implemented for Arabic topic identification. Expressly, this section presents the finding of our review regarding the most successfully used algorithms in the field of Arabic topic identification in the last decade.
Generally, three machine learning techniques can be used to classify documents, unsupervised, supervised, and semi-supervised. Many methods and algorithms were proposed for Arabic topic identification; however, our focus will be on the supervised classification techniques. Based on relevant studies [4,[37][38][39]  It contains three layers: the input layer which receives data, the hidden layer which performs data processing, and the output layer (Evangelin and Fred, 2017). The ANN represents one of the easy models to use. Also, it is usually appropriate for complex problems or large texts. The disadvantages of this model include the less recommendation of using simpler solutions or small texts. This method also needs loading the training data.

Conclusion and perspectives
It is hoped that this scoping review of the Arabic topic identification would guide current and future research directions. We followed PRISMA-ScR guidelines to ensure that only reliable resources were covered. This paper mainly focuses on the advancements in the Arabic topic identification. The boundaries of this paper cover the Arabic characteristics that can be challenging in this field; however, it still a fruitful area of research. Besides, major progress has been done regarding building and providing the required datasets, but still, a long way to go to catch up with other western languages such as English.
The most successful machine learning algorithms used in the last decade for Arabic topic identification have been listed based on several comparative studies and surveys. Nonetheless, more effort is still required to enhance their performances either by working on the pre-processing phases or by implementing other algorithms.
Through this review, we aim to initiate advancements in Arabic topic identification research