Research on traditional Mongolian sentiment analysis combined with prior knowledge

,


Introduction
Sentiment analysis, as one of the important research directions in the field of natural language processing, is mainly to help users quickly analyze the opinions, emotions, evaluations and attitudes expressed by entities and attributes of corpus [1].In recent years, with the emergence and rapid development of social media such as Weibo, forums and Douyin, people are more and more inclined to express their personal emotions on these platforms.Faced with countless data information, classifying it according to its emotions, so as to guide users to find valuable information, has become one of the current research hotspots.
As for mainstream languages such as Chinese and English, the research on Mongolian started relatively late compared to its sentiment analysis.Inevitably, there are few related studies, and Mongolian sentiment analysis research is still in the exploratory stage.Therefore, the research on Mongolian sentiment classification has important academic value and broad practical significance.The Mongolian language is mainly divided into the following two categories, the traditional (Uyghur-style) Mongolian language commonly used by the Mongolians in China and the Cyrillic Mongolian language used in Mongolia.Traditional Mongolian [2] consists of 7 vowels and 26 consonants, with a total of 33 letters.According to the position of its letters in the word, three variants of root, middle and affix can be made, and the main difference from Chinese and English is that the traditional Mongolian spelling is based on the word unit.Cyrillic Mongolian is mainly based on the Cyrillic alphabet and is written horizontally from left to right.Therefore, the differences between the two Mongolian languages are obvious [3].
With the attention of deep learning methods in the field of natural language processing, excellent pre-training models such as XLNET [4], GPT [5], and BERT [6] have appeared one after another.The open pre-training models of some related research teams already have more than 100 language models [7], including not only mainstream languages such as Chinese and English, but also some minor languages, such as Cyrillic Mongolian and Urdu, Persian, etc.However, due to the complexity of traditional Mongolian grammar and the relative scarcity of its corpus, there are few researches on Mongolian sentiment classification combined with pre-training models.Therefore, in order to improve the efficiency of traditional Mongolian sentiment analysis, a traditional Mongolian pre-trained sentiment analysis model integrating prior knowledge is proposed.
For this paper, the main contributions are as follows: A Mongolian sentiment dictionary is created, which provides a database for the model to incorporate prior knowledge.The traditional Mongolian sentiment analysis pre-training model is established by using the Transformer structure.Before the training of the sentiment analysis model, the BPE-Dropout word segmentation technology combined with Mongolian characteristics was adopted to alleviate the problem of unregistered words, and the Mongolian sentiment dictionary was integrated with the attention mechanism to enhance the semantic information represented by sentiment words, so as to improve the efficiency of the sentiment analysis model.

Pre-trained model
The pre-training technology [8] mainly trains the deep network structure through a largescale unlabeled text corpus to obtain a set of model parameters.Usually, this deep network structure is called "pre-training model".As for the advantages, the training cost of the pretrained model is low, and it can achieve better convergence effect with downstream tasks, thereby improving the performance of the model to a greater extent.Early pre-training techniques are static techniques.Since 2018, after ELMo [9] proposed a context-sensitive text representation method, GPT, BERT, XLNet and other pre-training language models have been proposed one after another.Subsequently, pre-training technology has been widely used in the field of natural language processing [10].Once the BERT model of the Google team was released in October 2018, it immediately broke the performance records of 11 NLP tasks, pushing the development of pre-trained language models to a climax.According to different model structures, emerging pre-training techniques [11] can be divided into two categories, one is the autoregressive language model XLNet, which can obtain true bidirectional context information; the second is an improved model based on BERT, It can be roughly divided into optimizing the model from the direction of improving the generation task, introducing multitasking, knowledge, and adjusting the training method.XLNet model, by using the three mechanisms of permutation language model, dual-stream self-attention and loop, not only solves the problem that the predictions between words in the BERT model are not independent, but also enables the autoregressive model to obtain true bidirectional context information.In 2019, in order to solve the problem that the parameters of the current pretraining model are too large, Lan Z et al. [12] proposed the ALBERT (A Lite BERT) model, which uses the Sentence-order prediction (SOP) task to replace the Next-sentence prediction (NSP) task in BERT, while reducing the amount of pre-trained model parameters, has acheved good results in multiple natural language understanding tasks.In 2020, by introducing multi-task learning, Sun Y et al. proposed the ERNIE2.0[13] model.Compared with other pre-training, the model can better capture the lexical, syntactic and semantic information in the language model, and more importantly, it is effective in reading comprehension, sentiment analysis and question answering tasks.
The disadvantage is that the above methods generally lack real-world knowledge and other problems.However, with the development of deep learning, pre-trained models incorporating external knowledge have gained widespread influence and achieved remarkable results in the field of natural language processing.
In 2019, Zhang et al. [14] proposed the ERNIE model by leveraging structured knowledge, which can learn more prior knowledge, thereby improving the efficiency of knowledgedriven tasks.In 2020, Tian et al. [15] introduced prior knowledge and proposed the SKEP model.This model can use the automatically acquired knowledge to perform emotion masking, construct three emotion knowledge prediction targets, and embed emotion information at the word, polarity and aspect levels into the pre-trained emotion representation.Experimental results show that, to a large extent, SKEP outperforms the Robert model and achieves state-of-the-art results on most test datasets.

Mongolian sentiment analysis model
Due to the serious phenomenon of unregistered words in Mongolian data preprocessing, Suyila et al. [16] adopted the BPE word segmentation technology to process their corpus in 2020.The experimental results show that the BPE technology can effectively deal with the low performance of the model due to the large number of unregistered words.In 2020, Bao Qi Li Muge [17] constructed a Mongolian emotional lexicon.The algorithm model is relatively simple and has high accuracy, but the overall efficiency of this sentiment analysis method based entirely on sentiment dictionary is low, and the model is difficult to be widely promoted.In 2021, for Mongolian news text classification, Qi Ligel [18] proposed a model of Bi-LSTM + attention mechanism, and achieved good results.With the further development of science and technology, the acquisition of sentiment corpus is also more convenient.In particular, the efficiency of sentiment analysis in mainstream languages such as Chinese and English has been greatly improved.However, due to some reasons, such as the lack of traditional Mongolian corpus, the progress of its research and development has been severely restricted.At present, there are few related studies on the combination of Mongolian sentiment analysis research methods and pre-training models in deep learning, among which there are few related studies on building prior knowledge models.In response to the above problems, this paper proposes a traditional Mongolian sentiment analysis model that integrates prior knowledge.By constructing a pre-training model of traditional Mongolian and integrating prior knowledge, the efficiency of Mongolian sentiment analysis can be improved to a greater extent.

Constructing Mongolian vocabulary
At present, most pre-training models are built in units of word or words.Therefore, if constructing a pre-training model based on traditional Mongolian, the most critical step is to create a Mongolian vocabulary list.A vocabulary, usually containing individual characters, words, or other tokens.Secondly, the pros and cons of the vocabulary are closely related to the word segmentation technology.Therefore, reasonable word segmentation technology can obtain more sufficient and accurate vocabulary information.In addition, an accurate vocabulary can make sentences to achieve a more natural segmentation effect.Today, there are some commonly used Chinese word segmentation methods: jieba word segmentation, NLPIR calculated by the Chinese Academy of Sciences, and THULAC word segmentation launched by Tsinghua University [19].Commonly used English word segmentation methods: BPE, WordPiece and Unigram, etc.
The research on traditional Mongolian started late, and there is still no unified Mongolian segmentation algorithm, so the research work is still in the exploratory stage.Aiming at the structural characteristics of traditional Mongolian, this paper adopts a word segmentation method that combines vocabulary and BPE-Dropout [20].There are 1.3 million unmarked Mongolian corpora collected by the School of Information Engineering, Inner Mongolia University of Technology.First, we preprocess it, and secondly, we split the corpus for the first time using simple space-based tokenization.Secondly, the frequency information of vocabulary is counted and added to the vocabulary in turn to construct Mongolian vocabulary information.Then, the corpus string is matched with the Mongolian vocabulary, and if the match is successful, the segmentation is performed.Finally, the BPE-Dropout segmentation method is adopted to further segment it.The pseudo code of the BPE-Dropout algorithm is shown in the following table: Among them, merges refers to the combination of any Mongolian root affixes to obtain new Mongolian words, and p represents the probability of randomly discarding the root affixes to obtain Mongolian words.

Creating Mongolian sentiment dictionary
By embedding the sentiment dictionary as prior knowledge into the pre-training model, the efficiency of Mongolian sentiment analysis is improved.This paper adopts the seven emotion categories proposed by Xu Linhong et al. [21], and the created Mongolian emotion dictionary contains 20,000 emotion words.These categories are divided into 'happy', 'like', 'surprise', 'sad', 'angry', 'disgust', 'fear'.The specific construction process is as follows: First, the translation is carried out on the basis of the emotional vocabulary ontology database of Dalian University of Technology, and the same emotional words in Mongolian after translation are merged.Then, words with strong emotional polarity and high word frequency are selected as seed words, and then they are classified by emotional polarity, and the semantic similarity between other unknown words and them is calculated by PMI method.Next, Calculating the sentiment polarity of unknown new words to expand Mongolian vocabulary.At last, Combining the translated Mongolian emotional words and the expanded emotional dictionary to get the final emotional dictionary.Point mutual information is mainly used to calculate the similarity between words.The similarity calculation formula between  1 and  2 is: In the above formula: ( 1 ,  2 )represents the probability of co-occurrence of  1 and  2 , ( 1 ) and ( 2 ) respectively represent the probability of  1 and  2 appearing alone. 1 means unknown word, and  2 is seed word.If the calculation result of formula ( 1) is large, that is, the similarity is high, it can be known that the sentiment polarity of the two words is the same, and vice versa.Using the multi-head attention mechanism, this paper fuses the prior knowledge with the model.Specific steps: First, comparing the Mongolian sentiment dictionary as prior knowledge with the input corpus, and focusing on the emotional words in the corpus.Then, the multi-head attention mechanism is used to perform multiple linear transformations on the input vector to obtain different linear values, and calculate the attention weights.Finally, the attention is calculated after mapping  ,  and  through the parameter matrix, and the calculation result of each attention head is concatenated.
(, , ) = (ℎ 1 , ℎ 2 , ⋯ , ℎ  )   (2) Among them, , ,  are the input word vector matrices for query, key and value, respectively.  is the weight matrix for multi-head attention.   ,    , and    represent the weight matrix parameters corresponding to the  attention head.As shown in Figure 1, take the process of sentiment analysis of a Mongolian corpus as an example.First, the sentiment corpus"ᠴᠢᠨ᠋ ᠠᠷ ᠰᠠᠢᠨ᠂ �ᠤᠪᠴᠠᠰᠤ �ᠤᠨ᠋ ᠠᠷ ᠨᠢ ᠨᠡᠯᠢᠶᠡᠳ ᠵᠦ᠋�ᠷ᠂ ᠡᠨ᠋ ᠡ ᠤᠳ᠋ᠠ� � ᠶ᠋ᠢᠨ �ᠷᠠ� � �ᠤᠳ᠋ᠠᠯᠳ᠋ᠤᠨ ᠠ�ᠯᠲᠠ ᠶᠡ� �ᠶᠠᠷᠲᠠᠢ"is input into the model.The English translation of this sentence is "The shopping was very pleasant, since the clothes are of good quality and the material is comfortable".Secondly, according to the Mongolian emotional dictionary established in this paper, the emotional words of the above corpus are marked with key points, so as to extract the key emotional words.Next, the emotional corpus is encoded, which is mainly composed of three parts: token encoding, position encoding, and segment encoding.Then, the corpus encoding information is put into the Transformer model in the traditional Mongolian BERT model for training, and emotional features are then extracted.Among them, Transformer mainly includes self-attention mechanism, layer normalization, multi-layer perceptron (MLP).The self-attention mechanism mainly determines its weight according to the similarity of different vectors in the sequence.In this way, to a certain extent, the information loss in the recurrent neural network due to the long distance of some information with greater correlation is avoided.Layer normalization mainly normalizes the internal dimensions of a sample in NLP application scenarios, which can speed up the convergence to a certain extent.The multilayer perceptron can appropriately increase the representation ability of the model.Finally, the cross entropy is used as the loss function Loss, and the Softmax function is used as the classification in the fully connected layer to output the prediction result.

BERT
In the above formula,  is the total number of samples. represents the number of categories.When performing seven classifications  = 7, When doing binary classification  = 2.    is symbolic function, the value is 0 or 1. Taking 1 when the true class of sample  is equal to , and 0 otherwise.   refers to the predicted probability that the observed sample  belongs to the class . =  �   +   � (5) Among them,   is the weight matrix and   is the bias. is the probability of emotion category.

Experiment
The experimental codes are all carried out on GPU-Nvidia Tesla P100, using Python3.7 and TensorFlow2.2.0 for training.

Experimental data
In this experiment, two Mongolian data sets were set.Data set 1 contains 50,134 pieces of data in seven emotion categories: 'happy', 'like', 'surprise', 'sad', 'angry', 'disgust', 'fear'.Data set 2 contains two emotion categories, positive and negative, with a total of 25,371 pieces of data.In all experiments, 80% of the data of each category of each dataset is selected as the training set, and as for the remaining 20% , is used as the test set.In the initial stage of the experiment.This paper adopts the accuracy rate, precision rate, the recall rate, and the F1 value as the evaluation indicators of the model.

Experimental analysis and results
Vocabulary is an important basis for word segmentation and text vectorization.Therefore, the quality of vocabulary can directly affect the classification performance of the model.In order to further verify the impact of vocabulary on the performance of the classification model, this paper sets up three sets of data of 200,000, 500,000, and 1.3 million, which are mainly used to construct vocabulary information.First, the information is pre-processed to remove non-text characters.Secondly, perform preliminary word segmentation processing on the corpus to obtain a basic vocabulary.Based on word frequency information, the vocabulary is optimized.Finally, by comparing the evaluation standard values of the classification results, such as the correct rate and loss value, to verify the impact of the vocabulary on the text classification effect of the model.Figure 2 shows that the larger the number of training corpora, the higher the accuracy of the model's classification.In the above figure, the vocabulary obtained by 130w data training has the highest accuracy of text classification.The reason may be that the vocabulary trained by a small amount of data is more affected by factors such as text topics, data sample balance, and emotional data sample size.With the increase of the data sample, the influence of irrelevant factors will gradually decrease.Therefore, the text information contained in the vocabulary is more comprehensive, and the reliability is higher, and further, the text classification ability of the model is stronger.Besides, it is not difficult to see that the models based on different word lists tend to be stable after 3 Epochs, that is to say, the model has strong feature learning ability and good performance for Mongolian text classification.
In order to further verify the effectiveness of the model mentioned above, this paper conducts experiments and compares the model with the classic text sentiment classification model in seven classification and binary classification.
The 5 classic text classification models are as follows: Fasttext model.This is an efficient word vector and text classification method proposed by Facebook.The structure of its model is similar to the CBOW (Continuous Bag of Words) model, which consists of an input layer, a hidden layer, and an output layer.Among them, the result of the output layer is a specific target.
LSTM model.Embedding LSTM into a gating mechanism to selectively retain information from previous texts, thereby mitigating gradient explosion and vanishing, enabling the model to process longer text data.
BiLSTM model.Through bidirectional LSTM, the model can obtain more key information in the text.CNN model.Text features are extracted through convolutional filters and then classified.TextATTBIRNN model.It is optimized by the bidirectional LSTM text classification model, mainly introducing the attention mechanism.For the representation vector obtained by bidirectional LSTM encoding, it can extract the most relevant information for decisionmaking through this mechanism.It can be seen from the above table that the classification accuracy of the traditional Mongolian BERT model is improved by 1.54%-4.22%compared with the classic text classification model when performing the seven classification task.As for the binary classification, compared with other models, the accuracy rate of the traditional Mongolian sentiment analysis model is increased by 1.88%-3.50%.It can be seen from the experimental results that the traditional Mongolian BERT model integrated with the Mongolian sentiment dictionary can undoubtedly improve the performance of sentiment classification.

Conclusions
In this paper, the traditional Mongolian sentiment classification technology incorporating sentiment dictionaries was investigated.The regularized word segmentation method combined with Mongolian characteristics is adopted, and the semantic information of the sentiment dictionary is integrated with the attention mechanism, which further improves the training efficiency of the model.During the experiment, we first verified the effect of the vocabulary dictionary trained on corpora of different magnitudes on the classification effect.Then, the models are compared experimentally for two different emotional datasets, and the results show that the proposed model has higher accuracy, with an accuracy rate of 91% for the seven-class task and 96.60% for the two-class task.
Although using the fusion sentiment dictionary method to enhance the semantic information of sentiment words can alleviate the problems of insufficient data resources and low classification efficiency in traditional Mongolian sentiment classification to a certain extent, the construction of Mongolian sentiment dictionary requires a lot of time and effort.Therefore, it is necessary to continue to optimize data augmentation methods to improve the quality and model efficiency of Mongolian sentiment analysis corpus.

Fig. 2 .
Fig. 2. Comparison chart of the effect of vocabulary on model accuracy.
merges←the arbitrary union of the split_word and the Mongolian root affix list; for merge from merges do remove merge with the probability p;

Traditional Mongolian sentiment analysis model incorporating prior knowledge BERT
Its excellent results have once again stimulated people's enthusiasm for research on pre-training models, and provided an important reference for the subsequent pre-training language model framework.This paper improves the BERT model and integrates the Mongolian sentiment dictionary with the sentiment analysis model, thereby improving the efficiency of the model to extract sentiment features.
(Bidirectional Encoder Representation from Transformers) is a pre-trained language model proposed by Devlin et al. in 2018, based on a deep Transformer.As soon as this model was released, it immediately refreshed the performance records of many NLP tasks.

Table 3 .
Data set Information.

Table 5 .
Accuracy comparison of different methods on two classification tasks.