Methods of eliminating homonymy within different, grammatically similar word groups

. The problem of automatic processing of natural language remains relevant for more than half a century. One of the important problems in the field of NLP is the creation of a semantic analyzer, which in turn goes through a number of steps. Determining homonymy is important in the semantic analysis of sentences. A method based on rules, a method based on statistical data, and methods based on machine learning are also used to determine homonymy. Statistical methods are mainly used to determine homonymy between grammatically similar word groups. In this article discusses the use of homonymy between two grammatically similar nouns and adjectives using statistical methods, namely Frequency and Bayesian methods. If bigrams and trigrams are used in the Bayesian method, the characteristics of word groups are classified in the frequentist method, and the parameters that can distinguish them are determined. The identified parameters are converted into numbers as a result of observations and probabilistic decisions are made.


Introduction
The problem of automatic processing of natural language remains relevant for more than half a century. The complexity of the problem and the lack of a clear idea indicate the difficulty of ways to solve it. Linguistic analyzers are particularly important as tools for automatic processing of sentences. Linguistic analyzers are divided into morphological, syntactic and semantic analyzers.
The phenomenon of homonymy is one of the important elements of the semantic analyzer. Homonymy detection is interpreted differently in different natural languages. In world computer linguistics, 3 methods are mainly used in the semantic analysis of sentences:  Rule-based method;  Method based on statistical data;  A method based on Machine learning.
These methods are used differently in different languages. For example, in Russian linguistics there are many studies devoted to the study of homonyms.

Experimental part
The phenomenon of homonymy, in turn, is studied by dividing it into such groups as lexical, morphological and phraseological homonymy. This article provides information on the methods used to eliminate lexical homonymy. Having studied foreign experiences in depth, we use rule-based, statistical data-based and machine learning-based methods to distinguish homonyms in the Uzbek language. When distinguishing homonyms in the Uzbek language, we divided them into groups such as homonyms within one word group, two word groups, three word groups, and four word groups according to their occurrence within word groups. It is recommended to use rule-based and statistical methods to distinguish homonyms between different word groups. At this point, the question arises: "When is a rule-based method preferable, and when is a statistical method convenient?" In order to use statistical methods, it is necessary to have a natural language corpus with a large aggregate database and all existing texts in it to be tagged. We used the rule-based method to determine homonymy within grammatically dissimilar word groups [2,3,4]. There are also such groups that form homonymy within different word groups that form homonymy within grammatically similar word groups.
The Uzbek language also has its own national corpus, and the database of this corpus contains billions of texts. Therefore, statistical methods can be used. For this, the issue of tagging corpus data, i.e. language modeling [1], is of course cross-cutting.
In this article, we will try to explain the process of differentiation using statistical methods in differentiating homonyms within the word groups presented in Figure 2.
Among the homonyms, there are words that belong to different word families and are united by the same affixes. But after the suffix is added, the constituents of this homonym may belong to different word groups, or the compounds may belong to different hyponyms. In such situations, the Trigram Hidden Markov Model (HMM) is used to distinguish homonyms. To use the trigram HMM, the tags of the words in the sentence must also be determined. Trigram HMM, V is a finite set of possible words and K is a finite set of possible labels of these words, with the following parameters: .   The parameter ( | ), each ∈ , ∈ . The value ( | ) determines the probability that word x is paired with word string s. S ( 1 , … , , 1 , … , +1 ) a collection of pairs of word sequences and label sequences, here, here ≥ 0, ∈ , = 1 … , ∈ = 1 … +1 = . We have for each ( 1 , … , , 1 , … , +1 ) ∈ the following, we need to determine the probability. (1) Here 0 = −1 = * . For example, we have n=5, 1 … 5 equal to " Buvijonning alomat gaplari bor a?" given a sentence, in order to apply joint probability for this sentence, it is divided into stem and suffixes. When tagging a given sentence, two different values N Adj N STOP and N N N STOP can be formed. The joint probability formula for each set of labels 1 … 6 is calculated as follows. Conditional probability for sequence N Adj N STOP This model is a noise-channel model. We use the second-order Markov model (trigram model) to calculate the value of.
denotes the conditional probability, where p(x|y) is the conditional probability of the x's obtained from the sentence "Grandmother has sign sentences a" and N Adj denotes the conditional probability of the y's obtained from N STOP tags. To calculate this probability, we need some parameters. Now we evaluate these parameters. A sample set X1 … Xn is given for the analyzer. For each sentence, a sequence of 1 … words and 1 … tags is defined. How do we estimate the parameters of the model given this information? It can be seen that there is a simple and very intuitive answer to this question. c(u,v,s)-determines the number of sequences of word groups u,v,s in the given data, for example c(N,Adj,V) -the homonym of the alomat in the given sentence is the adjective word group indicates the number of occurrences of words consisting of tags N and V before and after this word. Similarly, c(u, v) indicates how many times (u,v) meets the bigram characters. And c(s) determines how many times s has been seen in the given data corpus. And finally, c(s↝x) is the number of occurrences of the word x belonging to the s word group in the corpus: for example, c(Adj ↝alomat) is the number of occurrences of the word work in the corpus in the form of the Adj (adjective) tag.
Taking these comments into account, the maximum likelihood is given as follows and For example, for our example, this probability is calculated as follows and Thus, to estimate the parameters of the model, it is enough to count numbers from the language corpus with a tagged database and calculate the maximum likelihood using formulas. We perform the above calculations for the sentence"Buvijonning alomat gaplari bor a?"

Results and discussions
70,358 pieces of information were found when searching for the word symptom in the database of the national corpus of the Uzbek language. When the first 500 were analyzed and tagged, the following results were obtained .56 was found to be equal to From the calculated results, "Buvijonning alomat gaplari bor a?"the word symptom in the sentence is a homonymous word of the noun group and can be considered to mean " Odatdagidan o'zgacha tushunib bo'lmaydigan ". This method is called Bayes method. Another statistical method is the frequency method, which requires the classification of word groups. Let's consider the process of determining the above-mentioned alomat word using the frequency method.
Homonyms within the noun ∨ adjective group are classified according to the following parameters: 1. Vocabulary; 2. Stem and lemma; 3. Just a lemma 4. Only the stem Given a sentence with a homonym from the noun ∨ adjective group.
Bu bemordagi alomatlar covid-19 kasalligini eslatyapti. In this sentence, the word alomat is a homonym and has the following meanings. We use the data of the national corpus of the Uzbek language to classify the word symptom according to the above parameters. A total of 6823 pieces of information were found when searching for the word sign through the Uzschoolcorpora.uz site, 100 of which were analyzed 79 of them are nouns 21 are adjectives The analyzed 100 pieces of information were divided into core and appendices and the following results were determined.
Based on the obtained statistical data, we also divide the homonym in the given sentence into stems and suffixes So, based on the above statistics, the following decision is made for the word alomat in this sentence.
The information in this chart shows that the word alomat in the given sentence is a homonym of the noun group with a 95% probability, and it means a sign, an indicator. In this way, homonymy can be determined by frequency method. It can be seen that in order to use the frequency method, it is required to determine the classification parameters for each of the homonyms within different word groups and perform statistical calculations based on them.

Conclusion
Another common probabilistic approach is an algorithm based on the use of a Hidden Markov Model (HMM). The main idea of the algorithm is to choose a Grammar tag that maximizes the value of the following function for each word in the sentence: Here, ( | ) -conditional probability (estimated in the corpus), the probability of occurrence of the current tag with n predefined tags, ( | ) − conditional probability (calculated using corpus data) is a tag determined based on Grammatical properties of the word. Although HMM has complex computations, it has various simplifications in practice. Distinguishes word meanings with 96% accuracy for English grammar. Applying this model to Russian may be difficult compared to English, requiring very large corpora given the richness of word formation and word variation in Russian.