Филологические науки /6. Актуальные проблемы перевода

Студентка, Мурзина Э.Р.

Казанский (Приволжский) федеральный университет, Россия

К.п.н. Баранова А.Р.

Казанский (Приволжский) федеральный университет, Россия

Natural Language Processing

Abstract: This article deals with the range of issues in the development of systems that use Natural Language Processing methods. The linguistic structure of a natural language and the problems that a computer encounters when processing natural language are analyzed. Examples of the systems using automatic processing of human language are given.

Key words: language, translation, text, minority, speech, semantics, Natural Language Processing, Mathematical sciences, Information technologies, Mathematical linguistics.

With the development of humanity in the field of computer and innovative technologies such scientific areas as artificial intelligence and mathematical linguistics appeared. Mathematical linguistics is an area of mathematical and computer modeling of intellectual activities of a human which is used in creation of artificial intelligence systems. Artificial intelligence uses mathematical models to describe natural human languages. Together they are focused on solving the problem of creating a more convenient form of computer and human interaction. It means that if people build a natural-language interface for computers that can recognize and process a human language, anyone will be able to interact with a computer like with a human being.

The earliest program in history understanding a natural language is SHRDLU. It was developed in 1968–1970 by Terry Winograd at MIT. SHRDLU was written in the Micro Planner and Lisp programming languages on the DEC PDP-6 computer and used DEC graphics terminal. Later at the computer graphics lab at the University of Utah a complete three-dimensional rendering of SHRDLU’s “world” was introduced. A user communicated with SHRDLU by conventional English expressions. When SHRDLU received a command, it moved simple objects in the simplified “world of blocks”, namely, cubes, cones, balls and other geometric shapes. For example, the user could ask the program: “Put the red ball on the blue block” and then “remove off the cone”. The entire set of objects and possible actions included no more than 50 different words: nouns, verbs and adjectives. The possible combinations of these basic given words were quite simple, and the program was adept at figuring out what the user meant. SHRDLU also included a simple memory and could reproduce the actions that had been done earlier [7].

SHRDLU is considered as an extremely successful demonstration of artificial intelligence. It inspired the creation of new developments connected with the processing of a human language by artificial intelligence. However, later systems attempted to deal with realistic situations with ambiguity and complexity of the real world, and showed that, in practice, it was not so easy to achieve the set goals. Continuations of the work in this direction are the programs related to the increasing amounts of information that the program operated.

Douglas Lenat at Microelectronics and Computer Technology Corporation started the Cyc project in 1984. It is an example of a volumetric knowledge base, which allows programs to solve complex tasks of artificial intelligence on the base of inference and common sense. This system is capable of building logical chains and then responding to an asked question. Typical example of knowledge represented in the database is “Mammals are a class of vertebrates, the main distinguishing feature of which is feeding their young with milk”. So, if ask: “Do dolphins feed their young with milk?” the engine can make obvious inference that dolphins are mammals, and, consequently, answers the question correctly. Cyc contains over one million assertions, rules and current ideas fed in by people [4].

IBM Watson is a supercomputer from IBM, which is equipped with a question-answer system of artificial intelligence. A group of researchers under supervision by David Ferrucci created it. Its creation is a part of the DeepQA project. The main task for Watson is to understand questions stated in natural language, and to find answers to them in the database. Watson was named after IBM founder Thomas Watson. It contains a cluster of ninety IBM Power 750 servers, each of which uses a POWER7 eight-core processor. The total amount of RAM is more than 15 terabytes. The system had an access to 200 million pages of structured and unstructured data in a volume of 4 terabytes, including the full text of Wikipedia. In February 2011, a supercomputer took part in Jeopardy television show. The computer won and received 1 million dollars. Watson did not have an access to the Internet during the game [5]. Currently, IBM allows using Watson’s intelligence to stimulate innovations and new applications.

The method of machine learning is used to speed up the process of increasing the knowledge base of a system. Machine learning is a mathematical discipline that uses sections of mathematical statistics, numerical methods of optimization, probability theory and discrete analysis, which extracts knowledge from data. There is some relationship between answers and objects, but it is unknown, only the final set of “object-response” pair is known. Based on these data, it is required to restore dependence, that is, to construct an algorithm that is capable of giving an accurate enough answer for any object. The accuracy depends on the number of themes that system needs to classify. The more topics, the less accuracy, i.e. it is more difficult to compose an algorithm. In classical mathematical problems, objects are real numbers or vectors. In real applications, input data about objects can be incomplete, inaccurate and non-numeric; therefore it causes difficulties in solving the task.

Statistical methods are more often used to process a human natural language, which are based on counting the number of words in texts. As a rule, they are quite simple and do not require deep theoretical knowledge in linguistics, sometimes they require profound mathematical knowledge. The origins of these methods lie in mathematics and computational geometry. A vector represents a text; the length of a vector is a number of words. This entire math is hidden inside algorithms, and the algorithms are inside of libraries, so those who use the libraries do not need to understand mathematics. However, we will not be able to get an answer that will satisfy all the necessary logically correct conclusions using only this method.

Recognition and understanding a natural language requires tremendous knowledge of the system about the world around us and ability to interact with it. The very definition of the meaning of the word “to understand” is one of the main tasks of artificial intelligence. The quality of understanding depends on many factors, namely, language, national culture of a person, speed and intonation of pronunciation, accent, words order in a sentence, and so on. The greatest opportunities and high quality of analyses of texts can be obtained by conducting its full analysis. However, computer systems face many difficulties in processing human language [1].

In our speech, we often use different stylistic figures to enhance the expressiveness of utterances that are intended more for vivid imagination, than for formal computer processing. Syntactic means play an important albeit secondary role in the interpretation of a text. But a stylistic analysis of a text made by a computer has to lead to the synthesis of the text and cannot be reduced only to recognition of stylistic figures. The most common of them are: metaphor, metonymy, irony, hyperbole, personification, pun, epithet, comparison, and oxymoron. For example, the utterance “Sometime too hot the eye of heaven shines” (W. Shakespeare) [8] contains the metaphor “the eye of heaven” as the name of the Sun. We draw such a conclusion, relying on our experience of the cognition of reality and logical thinking. Epithet emphasizes some basic property of a determined object: fair sun, the sable night, wide sea. When describing a man’s personality or appearance epithet characterizes an object in a certain artistic way and reveals its features, for example, a sharp smile, a penetrating look, silvery laugh, a shadow of a smile. A person does not have any problems in understanding these adjectives, whereas a machine may have them while processing information. A comparison in the utterance “His mind was restless, but it worked perversely and thoughts jerked through his brain like the misfiring of a defective carburetor” (S. Maugham) [9] expresses the intensity of representation of the moment that happens to the person. The analysis of such texts can be a serious problem not only for electronic computers, but also for humans. Since most situations have some real or fictional basis, inserting them into a text is a reference to such situations.

Natural language processing allows machines to read, listen and understand a person. The complexity of the problem is that, as a rule, computers “force” people to speak to them using a special language. It has to be unambiguous and well structured, and all rules of a language have to be strictly obeyed. The linguistic structure of a natural language is more complicated and there is a set of changing characteristics. A context of a conversation determines the meaning of one or another phrase; it means that it is impossible to interpret unequivocally a part of a verbal construction [6].

For a full-fledged work, system of text analysis should be able to analyze the text fed in by a user. A sentence structure is analyzed in terms of syntax; concepts used in a text are analyzed in terms of semantics; correctness of the usage of concepts and purposes of their usage are analyzed in terms of pragmatics. Then, the system has to generate its own response in the internal representation suitable for logical inference and has to output the response in natural language [2, Р.106]. The most difficult part of this is to understand the subtext. The human brains easily cope with recognizing the implicit meaning of an utterance at the intuitive level, but computer systems are not yet capable of this.

However, there are natural language processing systems, which we use every day. Currently we easily talk to Siri on the phone. We can find out the weather just having said: “Siri, tell me the weather forecast for tomorrow”. We communicate with the browser when we search for information on Google, or with our own car when we want to change the radio wave or call someone.

Recently Facebook announced the launching of Trending, the new application that allows users to see not only the most popular topics, but also the most interesting ones to them personally. Facebook has done a great job of processing natural language and has learned to “understand” its own users. It automatically processes lines, concludes what was meant, and identifies people, things, places and events. Any “event” in the network whether it is a record by a single person or a mass trend becomes an addictive puzzle. Processing of all available texts makes it possible to understand what really attracts a user.

There are services that analyze records and distinguish between positive and negative reviews about brands on the Internet. Such systems are necessary for business to track attention to brands and tendencies to sell trends [3].

In addition, a system for the stock exchange is being developed, which will function better than an ideal broker. It will analyze and synthesize political and economic news, give forecasts. For such a large amount of information and news in modern world, it is very promising.

“To improve service, this call may be recorded”. Have you ever heard such a phrase when you dial a call center? Data of recordings are used to improve algorithms of automatic recognition of natural language. The data obtained during the processing of recordings help to create statistical models of how and what people say. As a result, it will be possible to increase a number of automated services and replace an operator with a robot without loss of quality of the service.

A breakthrough in human natural language processing using a computer is an innovative technology that will affect every sphere of human life and will open up new functional possibilities of computer systems. Also it will make their usage by humans easier and more convenient. Large corporations invest in the development of data processing methods and systems, because it will bring their work to a new level and allow solving more complex tasks. New level computer systems will appear in the nearest future. These computers will have artificial intelligence that will change our understanding of machines’ capabilities.

References:

1. Baranova A.R., Ladner R.A. Possibilities of machine translation from English into Russian // Tendencies of development of science and education. Part 2. Smolensk, 2016. – Pp. 46-49.

2. Bolshakova Ye.I., Klyshinskiy E.C., Lande D.V., Noskov A.A., Peskova O.V., Yagunova Ye.V. Automatic processing of texts in a natural language and computational linguistics. Moscow, 2011. – 272 p.

3. Burganova A.R., Baranova A.R. Linguo-stylistic specificity of innovative forms of headlines on the Internet // Information technologies in the research space of difference-structured languages: a collection of articles of I International Internet Conference of young scientists (December 5, 2016). Kazan: Publishing house of Kazan university, 2017. – Pp. 15-17.

4. Cyc. [electronic resource] // URL: https://ru.wikipedia.org/wiki/Cyc, (the date of access: 26.02.2017).

5. IBM Watson. [electronic resource] // URL:https://www.ibm.com/watson/, (the date of access: 27.02.2017).

6. Murzina A.R., Murzina E.R., Baranova A.R. Impact of computer technologies on language development // Information technologies in the research space of difference-structured languages: a collection of articles of I International Internet Conference of young scientists (December 5, 2016). Kazan: Publishing house of Kazan university, 2017. – Pp. 5-7.

7. SHRDLU. [electronic resource] // URL:https://en.wikipedia.org/wiki/SHRDLU, (the date of access: 26.02.2017).

8. W. Shakespeare. Sonnet XVIII. [electronic resource] // http://www.shakespeares-sonnets.com/sonnet/18, (the date of access:29.02.2017).

9. W. Somerset Maugham. The Narrow Corner. Vintage International, 2009 – P.17.