The pipeline processing of NLP

. The problem of NLP should be divided into several small parts and solved step by step. In this article, where NLP is necessary at every stage of solving the problem, all forms of text processing are considered. The step-by-step text processing is called a pipeline process in NLP. When creating any NLP model, the pipeline process is a sequence of steps that must be carried out. The planning and development of the text processing is considered as the starting point for the creation of any NLP project. This article discusses the steps involved in implementing a pipeline process and their role in solving NLP tasks. This article analyzed the most common preliminary processing steps on the NLP processing pipeline. All processing stages are pre-trained in various NLP libraries, identified as usable models. If necessary, additional, modified preprocessing steps can be developed depending on the given problem condition. One can determine how a particular initial processing stage serves a given NLP problem by many experimentations.


Introduction
Typically, the problem of NLP should be divided into several small parts and solved step by step. In this article, where NLP is necessary at every stage of solving the problem, all forms of text processing are considered. The step-by-step text processing is called a pipeline process in NLP [1][2][3][4]. When creating any NLP model, the pipeline process is a sequence of steps that must be carried out. The planning and development of the text processing is considered as the starting point for the creation of any NLP project. This article discusses the steps involved in implementing a pipeline process and their role in solving NLP tasks. Figure 1 below shows the main components of a common pipeline processing for developing a modern NLP system [5].
8. Monitoring and model updating.

Main part
The first step in the development of any NLP system is data collection relevant to the given task. To develop a rule-based NLP system, it is necessary to have training (test) data to design and examine the rules. The data we receive is rarely "clean", and in the next step, the text cleaning procedure is performed. Once the text has been cleaned, the text data needs to be converted to canonical form.
This action is performed at the stage of preliminary processing [6][7][8][9][10]. After that, it is necessary to develop the necessary features to perform the NLP task. These features are converted into a comprehensible format by modeling algorithms. In the next step, modeling and evaluation are performed. In this step, one or more language models are created and they are compared using the appropriate evaluation indicators (metrics). It is necessary to implement this model after selecting the best one from among the developed models. And finally, we constantly monitor the performance of the model and, if necessary, need to improve its performance and update it. In the primary stages, a lot of time is spent on feature development, modeling, and evaluation [6,7,[11][12][13][14][15].
Primary stages As mentioned above, NLP software typically analyzes text by segmenting it into words (tokens) and sentences. Thus, any NLP pipeline processing needs to develop a system that correctly performs the segmentation of text into sentences (sentence segmentation) and subsequent segmentation of sentences into words (word tokenization).
Sentence segmentation As a general rule, one can segment sentences by dividing text into sentences when periods and question marks appear. But there can be abbreviations, address forms (sh.kaccording to, va b.-etc.) or ellipses (...) that can break the normal rule. The ability to distinct sentences and words using standard methods in existing NLP libraries is provided in Python. The following Python code shows how to use the sentence and word separator from the Natural Language Tool Kit (NLTK) library: from nltk.tokenize import sent_tokenize, word_tokenize mytext = "Typically, an NLP problem should be solved step by step, breaking it down into several small parts. This article examines all the forms of text processing required at each step of solving an NLP problem." my_sentences = sent_tokenize(mytext) Word tokenization The main unit of the language is its vocabulary -lexicon. But what does the vocabulary (lexicon) contain? The term "word" has a broad meaning, and it is necessary to clarify this term in order to use it in a scientific manner. If we separate words from the text using separators (spaces, some punctuation marks, etc.), many tokens are generated in the text [16,17]. A token is any unit separated from the text by a boundary. There are 4 tokens in the sentence "Daraxt bir yerda ko'karadi": daraxt, bir, yerda, ko'karadi. There are 4 tokens in the sentence "Halol mehnat yerda qolmas": halol, mehnat, yerda, qolmas.
To turn sentences into words, like sentence tokenization, one can start with a simple rule of segmenting text into words based on the presence of punctuation marks. The NLTK library allows one to do this: for sentence in my_sentences: print(sentence) print(word_tokenize(sentence)) It should be remembered that most of the existing solutions (tokenizers) do not segment the given text into 100% correct tokens, and these algorithms are not perfect. For example, consider the following sentence: "  [:] are defined as separate tokens. Also, if one wants to tokenize tweets, the tokenizer splits the hashtag into two tokens: the "#" sign and the sequence after it. In such cases, one may need to use a customized tokenizer that one creates (by one selves) for one's purpose. Therefore, in the tokenizer, sometimes some elements of graphematic analysis must be performed:  sequence of letters;  numbers;  punctuation marks;  hieroglyphs;  separators;  various graphic symbols.

English version tokenization process of sentences
Uzbek version tokenization process of sentences Steps to be taken regularly. Let's take a look at the initial processing steps that are performed regularly in the NLP pipeline processing. Let's say one developing software that classifies an article on a news site as one of politics, sports, business, and other topics. Let us have a program that divides a sentence into segments and tokens. In this case, one need to start thinking about what kind of data is useful for developing a tool for separating news into groups.
Some of the frequently used words in the Uzbek language, for example, albatta, ammo, asosan, aynan, balki, barcha, bilan, biroq are not considered necessary for this task. Because they don't mean anything by themselves to distinguish the four categories. Such words are called non-essential words and are usually removed from further analysis. However, there is no standard list of non-essential words for the Uzbek language. Some of the non-essential words in the Uzbek language developed by the authors are presented in the following Table  1. Some NLP packages have lists of non-essential words (for some foreign languages), which in many cases can vary depending on the given problem.
Removing punctuation and/or numbers is a common step for many NLP problems such as text classification, data mining, and social media analysis. The following program code shows how to remove non-essential words, numbers, punctuation marks, and lowercase letters from a given set of English text: from uzcorpus import stopwords From string import punctuation def preprocess_corpus(texts): mystopwords = set(stopwords.words("uzbek")) def remove_stops_digits(tokens): return [token.lower() for token in tokens if token not in mystopwords not token.isdigit() and token not in punctuation] return [remove_stops_digits(word_tokenize(text)) for text in texts] It should be noted that these four processes are not mandatory or required to be followed sequentially for all NLP problems. The function above shows our NLP project how to implement the text processing steps. Text data is primarily processed through this function. Commonly performed primarily processing steps that take into account wordlevel features are stemming and lemmatization.

Results
Stemming and lemmatization. In the process of stemming, suffixes are removed from the word and the word is reduced to some basic form (making the word form without suffixes).
For each token, there is an initial (or normal) form of it (called a lemma). In speech (text), this initial form is used in various grammatical forms (inflection may also occur). The following picture shows the stemming stages of two sentences.

English version stemming process of sentences
Uzbek version stemming process of sentences At this point, we need to clarify the term word form. A word form is a form of lemma (lexeme) used in speech. The lemma "Daraxt" can take various grammatical forms belonging to the noun group and form many word forms:  daraxt, daraxtni, daraxtning, daraxtga, daraxtdan;  daraxtlar, daraxtlarni, daraxtlarning, daraxtlarga, daraxtlardan;  daraxtim, daraxting, daraxti, daraxtimiz, daraxtingiz, daraxtlari.
In the same order, other lemmas also appear as a word form when they are used in speech. The usual structure of the word structure in the Uzbek language is as follows: "Stem + word-builder + lexical form + syntactic form" The order and consistency of the placement of a grammatical device depends on its meaning and grammatical features, and is formed in the following sequence: 1) forming a new lexical meaning; 2) influencing the lexical meaning; 3) a tool that does not affect the lexical meaning, but connects the word is added.
In the Uzbek language, additional morphemes are added consecutively after the stem. The rules of morphotactic determine the order in which morphemes and allomorphs are arranged in the word form. Based on these, we observe the connection possibilities of some morphemes in the Uzbek language in the table below: From the table above, it can be seen that suffixes such as -lar, -im, -ing, -imiz, -si, -ni are added directly to the stem. It is added to the left side of another affix morpheme. The suffixes -man, -san, -miz, -siz cannot be added directly to the stem. Because before them, another morpheme (for example, a tense suffix) is added to the stem, and the next position is occupied by this group of suffixes, which have the characteristic of being located on the right side of affix morphemes. So, in the Uzbek language, the position of the stem and suffixes is minimally as follows:  prefix (word-builder);  a word-builder added after a word;  syntactic form builder; lexical form-building adverbs.
The stem means the main lexical meaning of the word. Suffixes give additional meaning to the word. Separating the word into morphemes is called morphemic analysis ( Figure 5). For example:

Fig. 4. morphemic analysis
In Figure 5 below, is presented the algorithm for determining the content of words: Lemmatization is the process of comparing all the different word forms of a word to a main word or lemma. Although this seems similar to the stemming process, they are actually different. The lemmatization process is shown below: Sumbula sprang to his feet and gasped: "Thank you, I'm having a dream." Fig. 6. Examples of the lemmatization process It should be noted that stemming and lemmatization should be distinguished. Because often the results of stemming and lemmatization in the Uzbek language seem to be the same, but they are completely different processes: stemming is the process of removing the suffix from the stem, and lemmatization is the process of determining the dictionary variation of the stem. Below are examples of the analysis of the stemming and lemmatization process: Lemmatization requires more linguistic knowledge, because the modeling and development of efficient lemmatizators for Uzbek language texts is still an open problem in NLP research.
Since lemmatization involves a certain amount of linguistic analysis of the word and its context, it is more time-consuming than the stemming process, and it is usually used only when necessary. Which lemmatizator or stemmer to use for the initial processing stages of the NLP pipeline processing is chosen according to the condition of the given problem.
Not all steps to remove non-essential words, numbers, punctuation, and lowercase letters from a given text are always necessary. For example, if one removes numbers and punctuation marks from text, the removal may not be very important at first. However, one usually replaces uppercase letters with lowercase letters before performing the stemming process on the text. One does not remove lexemes and lowercase letters from the text before the lemmatization process. Because in order to get the lemma, one needs to know the part of speech, and this requires that all tokens in the sentence are intact. Once one have a clear understanding of how to process data, it's a good idea to prepare a step-by-step list of primary processing steps to be taken. Examples of the Uzbek language morphoanalyzer developed by the authors (http://uznatcorpara.uz/) to tokenization, stemming and lemmatization steps in the NLP pipeline processing and the ER model in the database are presented below:  Additional stages of primary processing. A few common primary preprocessing steps in the NLP pipeline processing have been covered above. Although the essence of the texts is not clearly indicated, it is assumed to work with plain English text. Below are given some additional primary processing steps.
Text normalization. Let's look at the issue of identifying news in social media posts. Social media text is quite different from the language used in newspapers. Words can be written in different ways, for example, in abbreviated forms, phone numbers can be written in different formats, names are sometimes written in lowercase letters, etc. When developing NLP tools to work with such data, we need to create a canonical form of the text that covers all changes. This process is called text normalization. Some common steps to normalize text are converting text to all lowercase or uppercase, converting numbers to text (e.g. 9 to nine), expanding abbreviations, etc. A simple method of text normalization is provided in the Spacy package.
Identifying the language. Most of the web content is written in languages other than English. For example, let's say one was asked to collect all the reviews of product on the internet. Analyzing various e-commerce websites, when one starts scanning product pages, one come across a few non-English reviews. Since the main part of the NLP pipeline processing is built with language-specific tools, is it necessary to make regular changes in our NLP pipeline processing that is waiting for English text? In such cases, language identification is performed as the first step in the NLP pipeline processing. One can use NLP packages like Polyglot for language identification. After this operation, the next steps of the NLP pipeline processing can be observed to be language specific.
Many people around the world speak more than one language in their daily lives. Thus, they use several languages in their posts on social networks. As an example of code mixing, we can see the phrase Singlish (Singapore slang + English language) in LDC in Figure 5. One phrase contains words from Tamil, English, Malay and three Chinese languages. Code mixing refers to this phenomenon of switching between languages. When people use more than one language in their writing, they often write words in those languages in Latin script, with English spelling. So, along with the English text, words in other languages are also written. This process is known as transliteration. Both of these phenomena are common in multilingual communities and should be taken into account during the primary text processing. The general primary processing steps have been discussed above. Although this list is not complete, we hope it gives one an idea of the different primary processing methods that may be required depending on the nature of the data set.

Extended processing
We will consider the issue of developing a system for identifying the names of individuals and organizations in a collection of one million documents in the company. The common steps of text processing that we have discussed earlier may not be appropriate in this context. Identifying names requires us to implement POS tags. Because identifying the appropriate proper nouns can be useful in determining the names of individuals and organizations. How do we do POS tags during the primary processing stage of the project? Primary-trained and easy-to-use POS taggers are used in NLP libraries like NLTK, spaCy, and Parsey McParseface Tagger. One typically don't need to develop its own POS tagging solutions. It is important to note that for the same primary processing step, there may be differences between the results from different NLP libraries. This is due to differences in implementation and algorithms between different libraries. Which library (or libraries) to use in an NLP project depends on the context of the given problem. Let's look at a slightly different problem: in addition to identifying the names of individuals and organizations in our company's collection of millions of documents, we are also asked to determine whether a particular individual and organization are somehow related. To do this, we need to develop a method for identifying patterns that indicate the "relationship" between two persons in a sentence. This requires us to have some form of syntactic representation of the sentence. In addition, we need to outline a method to define and bind multiple references to an object.

Conclusion
This article analyzed the most common primary processing steps in the NLP pipeline processing. All processing steps are defined as primary-trained, usable models in various NLP libraries. If necessary, additional, customized primary processing steps can be developed depending on the given problem condition. One can determine how a particular initial processing stage serves a given NLP problem by many experimentations. In the morphoanalyzer of the Uzbek language developed by the scientists of Tashkent State University of Uzbek Language and Literature named after Alisher Navoi, tokenization, stemming and lemmatization stages of the NLP pipeline processing were developed and implemented.