Ôèëîëîãè÷åñêèå íàóêè/6.Àêòóàëüíûå ïðîáëåìû ïåðåâîäà
PhD
Blagodarna O.M.
Karazin
Kharkiv National University, Ukraine
Translation is one of the most complicated
tasks of the human mind, which requires not only linguistic knowledge but also
extra-linguistic knowledge of the world. The recent progress in the field of Machine Translation (MT) has caused developers of Computer Aided Translation tools (CAT-tools) to incorporate the possibility of MT integration into
the translation workflow. In order to gain a better insight into the different
approaches to MT integration, the abstract analyzes classification of MT approaches as of today.
MT tools try to retrieve the best
matches for the sentences of the text to be translated from a bilingual text
archive or database containing sentence-level alignments of existing
translations and their original texts. Yet MT does not presuppose any human
intervention apart from the initial step of setting up the MT engine and
possibly training this engine for specific purposes. MT aims to create a
translation on its own; its output
can then be post-edited by a human translator to achieve adequate quality; it aims to further increase
productivity by providing automatic translations, which may or may not be
post-edited by a human translator depending on the quality needs.
As of today the basic categories
of MT systems are Rule-Based MT (RBMT) as opposed to Corpus-Based MT (CBMT),
which in itself has two significant sub-divisions into Example-Based MT (EMBT)
and Statistic MT (SMT) [2].
RBMT provides a translation
drawing from linguistic information about source and target languages retrieved
from compiled dictionaries and grammars covering the main semantic,
morphological, and syntactic regularities of each language respectively. There
are three types of RBTM:
1.
direct systems:
dictionary-based MT, that map input to output with basic rules (more or less
word per word);
2.
transfer-based
systems: employ morphological and syntactical analysis; they require an
intermediate representation that captures the “meaning” of the original
sentence and may pass on two levels – syntactic (superficial transfer) or
semantic (deep transfer);
3.
interlingual
systems: use an abstract meaning.
The EBMT is an approach first
suggested in Japan in the early 1980s [4],
while the SMT approach was developed at IBM in the late 1980s [1]. The major difference between EBMT and SMT is
that SMT considers translation as a “statistical optimization problem” [3, p. 17] and is based on probability calculations over
large bilingual corpora, while EBMT tries to find analogies between an input
sentence and examples from a bilingual corpus applying more “traditional”
linguistic means like (morpho-)syntactic analysis and thesauri.
EBMT is similar to Translation
Memory, it reuses existing translations and recombines fuzzy matching resulting
in so-called sub-leveraging, but does not conduct any deep linguistic analysis.
The ideal translation unit in EBMT is a sentence. The first term used to
describe this principle was the “machine translation by analogy type”,
elaborated in the early 1960s by M.Nagao [4] and focused on the process of computer learning of the
grammatical rules of a language along the lines of the process of learning a second language by people.
The core idea was that grammatical rules will emerge by comparing differences
in sentences by first giving the computer very short simple sentences, and then
longer step by step. This experiment was not successful because of poor speed
and memory capacity of computers. Nevertheless, the conclusion was that the
grammatical rules could be extracted automatically by simulating the human
language learning process. The aforesaid principle was further elaborated and
from the 1990s the analogy principle became very famous round the world.
SMT became popular around 1990s
and has grown out to be the most studied MT methodology today. It is crucial to
stress that this is a pattern matching technology rather that a linguistic
technology since the translations are generated on the basis of statistical
models. The parameters of the latter are derived from the analysis of bilingual
text corpora. The translation system learns statistical relationships between
two languages based on the samples that are fed into the system. Since it is
pattern-driven, the more samples the system sees, the stronger the statistical
relationships become. Being data driven, language independent and relatively
cheap (e.g. Moses) this approach can yield satisfactory output quality by means
of extensive bilingual corpora analysis in combination with supervision by
experts in the field.
SMT systems implement a highly
developed mathematical theory of probability distribution and probability
estimation which is rooted in the work of Federick Jelinek at IBM T.J. Watson
Research Centre and a seminar paper by Brown et al. [1]. SMT systems learn a translation model from a
bilingual parallel corpus and a language model from a monolingual corpus. The
original idea assumes an unsupervised approach merely relying on the surface
forms of the text with no further linguistic or human intervention, though
there have been a number of modifications and enhancements afterwards.
Thus, instead
of building machine translation systems by manually writing translation rules,
SMT systems are built by automatically analyzing a corpus of translated texts
and learning the rules [3]. SMT has been embraced by the
academic and commercial research communities as the new dominant paradigm in
machine translation. The methodology has left the research labs and become the
basis of successful companies such as Language Weaver and the highly visible
Google and Microsoft web translation services. Even traditional rule-based
companies such as Systran have embraced statistical methods and integrated them
into their systems.
CBMT systems aim at producing
translations by automatically selecting suitable fragments from the source
language side of the retrieved translation units (TUs) and building the translation from the corresponding
elements of the target language side. Due to the complexity of this
recombination task, not every TU contained in a translation archive is equally
suited for reuse in systems translation memory and CBMT environments [5].
Keeping that in mind, the newest
approach offers hybrid systems, combining the best of both worlds. This can be
achieved in a number of ways:
• Statistics guided by rules:
rules are used to pre-process data in an attempt to better guide the
statistical engine. Rules are also used to post-process the statistical output
to perform functions such as normalization;
• Rules post-processed by
statistics: translations are performed using a rule-based engine. Statistics
are then used in an attempt to adjust/correct the output from the rules engine.
Bibliography:
1.
Brown P., Cocke
J., Della Pietra S., Della Pietra V., Jellinek F., Lafferty J., Mercer R.,
Roossin R. (1988): A statistical approach to language translation. In
Proceedings of the 12th International Conference on Computational Linguistics
(COLING-88), Budapest, August 1988, 71-76.
2.
Carl M., Way A.
(ed.). (2003). Recent Advances in Example-Based Machine Translation: From Real
Users to Research. Springer Berlin Heidelberg.
3.
Koehn, P.
(2010). Statistical Machine Translation. Cambridge et. al: Cambridge University
Press.
4.
Nagao M. (1984).
“A framework of a mechanical translation between Japanese and English by
analogy principle”. In Artificial and human intelligence. Edited review papers
presented at the International NATO Symposion on artificial and human
intelligence, Lyon, 1981, ed. Alick Elithorn and Ranan Banerji, 173-180.
Amsterdam, New York, Oxford: North Holland.
5.
Reinke U. (2013)
State of the Art in Translation Memory. Translation: Computation, Corpora,
Cognition. Special Issue on Language Technologies for a Multilingual Europe,
ed. by Georg Rehm, Felix Sasaki, Daniel Stein, and Andreas Witt. Vol.3(1).