Ôèëîëîãè÷åñêèå íàóêè/6.Àêòóàëüíûå ïðîáëåìû ïåðåâîäà

 

PhD Blagodarna O.M.

Karazin Kharkiv National University, Ukraine

Recent Approaches to Machine Translation

 

Translation is one of the most complicated tasks of the human mind, which requires not only linguistic knowledge but also extra-linguistic knowledge of the world. The recent progress in the field of Machine Translation (MT) has caused developers of Computer Aided Translation tools (CAT-tools) to incorporate the possibility of MT integration into the translation workflow. In order to gain a better insight into the different approaches to MT integration, the abstract analyzes classification of MT approaches as of today.

MT tools try to retrieve the best matches for the sentences of the text to be translated from a bilingual text archive or database containing sentence-level alignments of existing translations and their original texts. Yet MT does not presuppose any human intervention apart from the initial step of setting up the MT engine and possibly training this engine for specific purposes. MT aims to create a translation on its own; its output can then be post-edited by a human translator to achieve adequate quality; it aims to further increase productivity by providing automatic translations, which may or may not be post-edited by a human translator depending on the quality needs.

As of today the basic categories of MT systems are Rule-Based MT (RBMT) as opposed to Corpus-Based MT (CBMT), which in itself has two significant sub-divisions into Example-Based MT (EMBT) and Statistic MT (SMT) [2].

RBMT provides a translation drawing from linguistic information about source and target languages retrieved from compiled dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. There are three types of RBTM:

1.     direct systems: dictionary-based MT, that map input to output with basic rules (more or less word per word);

2.     transfer-based systems: employ morphological and syntactical analysis; they require an intermediate representation that captures the “meaning” of the original sentence and may pass on two levels – syntactic (superficial transfer) or semantic (deep transfer);

3.     interlingual systems: use an abstract meaning. 

The EBMT is an approach first suggested in Japan in the early 1980s [4], while the SMT approach was developed at IBM in the late 1980s [1]. The major difference between EBMT and SMT is that SMT considers translation as a “statistical optimization problem” [3, p. 17] and is based on probability calculations over large bilingual corpora, while EBMT tries to find analogies between an input sentence and examples from a bilingual corpus applying more “traditional” linguistic means like (morpho-)syntactic analysis and thesauri. 

EBMT is similar to Translation Memory, it reuses existing translations and recombines fuzzy matching resulting in so-called sub-leveraging, but does not conduct any deep linguistic analysis. The ideal translation unit in EBMT is a sentence. The first term used to describe this principle was the “machine translation by analogy type”, elaborated in the early 1960s by M.Nagao [4] and focused on the process of computer learning of the grammatical rules of a language along the lines of the process of learning a second language by people. The core idea was that grammatical rules will emerge by comparing differences in sentences by first giving the computer very short simple sentences, and then longer step by step. This experiment was not successful because of poor speed and memory capacity of computers. Nevertheless, the conclusion was that the grammatical rules could be extracted automatically by simulating the human language learning process. The aforesaid principle was further elaborated and from the 1990s the analogy principle became very famous round the world.

SMT became popular around 1990s and has grown out to be the most studied MT methodology today. It is crucial to stress that this is a pattern matching technology rather that a linguistic technology since the translations are generated on the basis of statistical models. The parameters of the latter are derived from the analysis of bilingual text corpora. The translation system learns statistical relationships between two languages based on the samples that are fed into the system. Since it is pattern-driven, the more samples the system sees, the stronger the statistical relationships become. Being data driven, language independent and relatively cheap (e.g. Moses) this approach can yield satisfactory output quality by means of extensive bilingual corpora analysis in combination with supervision by experts in the field.   

SMT systems implement a highly developed mathematical theory of probability distribution and probability estimation which is rooted in the work of Federick Jelinek at IBM T.J. Watson Research Centre and a seminar paper by Brown et al. [1]. SMT systems learn a translation model from a bilingual parallel corpus and a language model from a monolingual corpus. The original idea assumes an unsupervised approach merely relying on the surface forms of the text with no further linguistic or human intervention, though there have been a number of modifications and enhancements afterwards.   

Thus, instead of building machine translation systems by manually writing translation rules, SMT systems are built by automatically analyzing a corpus of translated texts and learning the rules [3]. SMT has been embraced by the academic and commercial research communities as the new dominant paradigm in machine translation. The methodology has left the research labs and become the basis of successful companies such as Language Weaver and the highly visible Google and Microsoft web translation services. Even traditional rule-based companies such as Systran have embraced statistical methods and integrated them into their systems.

CBMT systems aim at producing translations by automatically selecting suitable fragments from the source language side of the retrieved translation units (TUs) and building the translation from the corresponding elements of the target language side. Due to the complexity of this recombination task, not every TU contained in a translation archive is equally suited for reuse in systems translation memory and CBMT environments [5].

Keeping that in mind, the newest approach offers hybrid systems, combining the best of both worlds. This can be achieved in a number of ways: 

• Statistics guided by rules: rules are used to pre-process data in an attempt to better guide the statistical engine. Rules are also used to post-process the statistical output to perform functions such as normalization;

• Rules post-processed by statistics: translations are performed using a rule-based engine. Statistics are then used in an attempt to adjust/correct the output from the rules engine.

 

Bibliography:

1.           Brown P., Cocke J., Della Pietra S., Della Pietra V., Jellinek F., Lafferty J., Mercer R., Roossin R. (1988): A statistical approach to language translation. In Proceedings of the 12th International Conference on Computational Linguistics (COLING-88), Budapest, August 1988, 71-76.

2.           Carl M., Way A. (ed.). (2003). Recent Advances in Example-Based Machine Translation: From Real Users to Research. Springer Berlin Heidelberg.

3.           Koehn, P. (2010). Statistical Machine Translation. Cambridge et. al: Cambridge University Press.

4.           Nagao M. (1984). “A framework of a mechanical translation between Japanese and English by analogy principle”. In Artificial and human intelligence. Edited review papers presented at the International NATO Symposion on artificial and human intelligence, Lyon, 1981, ed. Alick Elithorn and Ranan Banerji, 173-180. Amsterdam, New York, Oxford: North Holland.

5.           Reinke U. (2013) State of the Art in Translation Memory. Translation: Computation, Corpora, Cognition. Special Issue on Language Technologies for a Multilingual Europe, ed. by Georg Rehm, Felix Sasaki, Daniel Stein, and Andreas Witt. Vol.3(1).