Современные информационные технологии/2. Вычислительная техника и программирование

 

Zhumanov Zh.

Al-Farabi Kazakh National University, Almaty, Kazakhstan

 

Model and algorithm for determining the meaning of ambiguous words with a parallel corpus of Kazakh and English languages

 

Introduction

 

This article deals with the problem of statistical approach to machine translation of languages that do not have linguistic corpora. Statistical machine translation is a leading technology to create translation software now. However, it requires analysis of large volumes of pre-processed text data (corpora) for each of the languages involved, and joint (parallel). This article shows how one can take advantage of this approach, even when such data is unavailable. In this case, the technology is used to improve the quality of translation by determining the meaning of ambiguous words.

 

Statistical approach to machine translation

 

Statistical machine translation is an approach to solving the problem of machine translation, where translation is based on statistical models. The parameters of these models are determined by analysis of bilingual parallel corpora. The idea of statistical machine translation was suggested by William Weaver in 1949. This approach has been “born again” in 1991 by members of one of the research centers of IBM. Currently, it is one of the most studied methods of machine translation. [1]

Let us represent the main points of this approach. P(k) - the probability that the translation software will be presented with a sentence k. P(e|k) - conditional probability - the probability that the sentence in the target language e corresponds to the sentence in the input language k.

Assume that the translation software is given a sentence k in the Kazakh language. The program's objective - to find such a sentence in English e that maximizes P(e|k). Found sentence is usually called the most likely translation. Then it will be denoted by e'. With this in mind, we can write:

 

e' = argmax P(e|k) (1)

e          

 

Using Bayes formula:

 

, (2)

 

we get:

 

 

e' = argmax P(e ) P(k|e) (3)

e                  

 

Component P(e) is called a model of language. In practice, it is determined by frequency of sentence e usage in the language and designed to "control" the correctness of this sentence. As a rule, sentences with wrong grammatical structure and semantics are used rarely in any language. Consequently, the probability of use of such sentences will be very small.

Component P(k|e) is called a model of translation. It shows the probability that the sentence k and e in different languages correspond to each other.

Given that in natural languages, there are valid sentences that may differ very slightly (e.g. with genus or number of its members) is more efficient for software to analyze not complete sentences, but groups of words from which they are composed. This is done by assuming that the correct sentence consists of the correct phrases.

Group consisting of n words is called n-gram. By analogy, a group consisting of 1 word is called unigramm, 2 - bigram, 3 - trigram. Thus, we get one of the fundamental propositions of the statistical approach - if the sentence consists of valid n-grams, it is likely that it is correct.

For the model of bigrams can be offered: P(y|x) - conditional bigram probability - the probability that the word «y» follows the word «x». This probability is defined as follows: the number of occurrences of the group «xy» divided by the total number of occurrences of the word «x».

When constructing the language model, the probability P (k) is defined as follows:

 

 (4)

 

Trigram model became more widely used: P(z|xy) = number of occurrences of the group «xyz» divided by the number of uses of «xy».

 

 (5)

 

To determine the translation model, P (k|e), the trigram model is also used. In a parallel aligned corpus, consisting of trigrams, multi-step analysis is conducted, at each stage of which the correspondence between elements of the trigrams in different languages is successively determined and these correspondences are assigned probabilities. [2]

The main difficulty in applying the statistical approach is the need to use a parallel bilingual corpus and, sometimes, corpus for each of the languages involved. The task of creating such corpora is far enough from the current problem and has its own characteristics. In case that there are no ready-made developments in the field of corpus linguistics for languages involved in translation, the use of the statistical approach to machine translation becomes inefficient.

One of the advantages of the statistical approach is that it allows dealing with the problem of ambiguous words. As can be seen from the above overview, in statistical approach each ambiguous word is associated not with literal translation, but with most likely one, which is determined on the basis of linguistic corpus used. If instead of a full parallel corpus make a corps from the sentences in which the ambiguous words used in different contexts, and modify the described mathematical model, the resulting model will solve the problem of ambiguous words translation.

The task of creating of corpus, whose sentences contain ambiguous words, seems to be simpler in the sense that in this case it is easier to control its quality (representativeness, balance of topics, genres, etc.). Also requirements to the volume of the corpus decrease.

 

Model for determining the meaning of ambiguous words based on the statistical approach

 

Let x be ambiguous word in the Kazakh language, y - its English translation (depending on the context), P(y|x) - the probability that y is the translation of x in this context. As shown in [3], there are 5 types of context: co-text, rel-text, chron-text, bi-text, and non-text. For this problem we have co-text - the words of the sentence directly related to the ambiguous word x. Related word (z) can precede x or be after it. Necessary to consider two cases: P (y|zx) - y is a translation of x in the context of the «zx»; and P(y|xz) - y is a translation of x in the context of the «xz». (Note: x and z are words of the Kazakh language, y is a word of English).

By analogy with (3) we obtain:

 

y' = argmax P(y|zx) = argmax P(y) P(zx|y)

y' = argmax P(y|xz) = argmax P(y) P(xz|y)

 

where, y'- the required translation of ambiguous word x.

P(y) - reflects how correct the word is in English. Since initially all the translations of ambiguous word are known (taken from the dictionary), then P(y) is always equal to 1. Choosing the correct value of y' in this case depends on P(zx|y) or P(xz|y). These values are defined as follows: the number of sentences with y in the English part of the corpus is divided by the number of sentences related to them in the Kazakh part of the corpus and contain the group zx or xz. In the case when the context of use of x depends on the words in front of it (in the case zx), the expression P (xz|y) will be very small. When the context of the use of x depends on words, standing after it (the case of xz), the expression P(zx|y) will be very small.

 

The algorithm for determining of ambiguous words' meaning

 

The model described can be realized by the following algorithm of determining the meaning of ambiguous words with the parallel corpus of Kazakh and English languages:

1 Define all possible translations of x using a dictionary.

2 For each of the translation calculate the values of P(zx|y) and P(xz|y) by using the corpus.

3 For translation of the word x, corresponding to a given context take the value y that maximizes the probability, calculated in the previous step.

4 Case in which all the calculated probabilities are equal to 0, indicates the incompleteness of the corpus. In this case, as the most likely translation takes the value y which is often used in English-language corpus, together with x.

5 If some of the translations have the same probability, then take the most common in the corpus from them, together with x without regard to context.

The last two cases are possible when corpus does not have all possible options for the use of x. In handling such exceptions fixing of these events must be included to further supplement the corpus.

 

Conclusion

 

This article describes a model and an algorithm for determining the value of ambiguous words with a parallel corpus of Kazakh and English languages. The advantages of statistical machine translation may be used to improve the quality of translation, without having large linguistic corpora.

 

References

 

1 Manning, Christopher D. & Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

2 Kevin Knight. A Statistical MT Tutorial Workbook JHU summer workshop. April 30, 1999.

3 Alan K. Melby, Christopher Foster. Context in translation: Definition, access and teamwork The International Journal for Translation & Interpreting Research Vol 2, No 2 (2010).