Ôèëîëîãè÷åñêèå íàóêè/ 7. ßçûê, ðå÷ü, ðå÷åâàÿ êîììóíèêàöèÿ

Cand. Sci. (Philology) Nevreva M.N., Duvanskaya I.F., Kudinova T.I.

Odessa National Polytechnic University

PROBABILISTIC-STATISTICAL MODEL AND THE STAGES OF ITS CREATION (ON THE MATERIAL OF THE ENGLISH TEXT CORPUS “CHEMICAL ENGINEERING”)

In every field of human activity there is a constant selection of linguistic  means to ensure the implementation of information-communicative tasks, and this, in turn, causes the differentiation of national  language to a number of functional styles – scientific, fiction, journalistic, and others.

In studying the functional language features of any of these styles it is expedient to single out separate sections from the general discourse and to explore the totality of the language resources combined by the subjects of the given communication area. The totality of units at all levels of the language system structure constitutes a model of a particular subject area.

The most effective  method for selecting the lexical material for linguistic research is to extract the object of analysis from text corpora of the studied domain, i.e. the general totality  currently represented by the special electronic text corpora of national languages, namely, British National Corpus, American National Corpus or Cambridge International Corpus.

This technique, which was adhered to by  L.V.Scherba [1], requires the survey of large size of text sample on the basis of which researchers are able to obtain valid and reliable results.

 To create an adequate probabilistic-statistical vocabulary model of a sublanguage the requirements to the corpus are as follows:

1.                 Strict attribution of texts to a specific field of communication.

2.                 Chronological limitations of the text sample material.

3.                 Completeness of texts regardless of the length in word usages.

In forming models the sample is quite often formed not from completed text but from extracts with a certain amount of word forms, e.g. 1,000 to 10,000 units [1, p.62]. With this approach, however, a great part of the significant information on the sublanguage lexicon composition and lexical groups stratification is lost.

4.                 Sufficient size of the sample from the electronic text corpus to obtain statistically reliable source material.

To determine the object of research the method of overall survey is generally used. In view of the fact that each subject area is quite complex, it is usually inaccessible to direct observation and copying. In view of the fact that each subject area is quite complex, it is not free for direct observation and copying. Therefore its analog (model) is compiled, i.e. the semantic space, which is a certain field of discourse, reflecting the portion of objective reality.

The formation of semantic space is the first stage in the process of creating a probabilistic and statistical model (frequency dictionary) of a particular area of communication. This stage involves determining the nomenclature of sub-themes of the semantic space and their sharing. This is followed by quantitative and statistical calculations that indicate the units of the prospective frequency list.

The present work is devoted to the lexicographic study of scientific functional style. The electronic text corpus of the technical specialty "Chemical Mechanical Engineering" is taken as an object to be considered, where probabilistic and statistical features typical for this type of style are exhibited. The process of creating its model is described as well.

The novelty of the work lies in the fact that such a model has been compiled for the first time since there is no mention in the available literature of the study, with exactly this branch of engineering serving as the research material.

Textual materials, which made the basis of objective data on the technical specialty "Chemical Mechanical Engineering"  were derived from the texts of the ten-year period English and American magazines: Chemical Engineering, Chemical Processing, Chemical Engineering Progress etc., which are part of the national English and American corpora British National Corpus and American National Corpus, since these corpora are characterized by not only oral speech subcorpus, but also text sets  relating to scientific communication.

Overall survey of the selected part of the electronic text corpus "Chemical Mechanical Engineering" was carried out. The considered units included all autosemantic and syntactic words, numerals written in words, conventional abbreviations. Proper names, mathematical symbols, formulae and foreign insertions were not taken into consideration.

Formation of the semantic space which simulates the semantics of this field of scientific communication occurred on the basis of peer  review  of  experts and specialists in the field of technical knowledge  "Chemical Mechanical Engineering".

Sub-themes of semantic space of the sublanguage "Chemical mechanical engineering" and their percentage ratio are given below:

1.                 Processes and apparatus of chemical technology  – 30%

2.                 Chemical engineering machinery design  – 35%

3.                 Corrosion of chemical equipment – 25%

4.                 General chemical technology – 10%.

The next stage is dedicated to creating the actual probabilistic-statistical model, i.e. frequency dictionary of the "Chemical Mechanical Engineering" domain. It consists of several steps:

1.                  Compiling  the alphabetical ranking list of all word forms of the text.

2.                  Drawing up a frequency list in which all word forms are arranged in descending order of frequency.

3.                  Reduction of all word forms of the obtained frequency list to the main vocabulary units.

4.                  Calculation of the statistical parameters of each word and  identification of some  common linguistic and information regularities of the text under study.

In compiling the alphabetical ranking list the marking method was used, allowing to distinguish  between grammatical and lexical-grammatical homography. The marking system involves the use of marking codes-indices expressed in the Latin alphabet. Word forms were distinguished on the level of word classes, for example, pump as a noun and pump as a verb; to as a preposition and to as a particle; finite and non-finite  verb forms: worked as Past Indefinite and worked as Past Participle; different functions of the verbs to have, to be, should, would.

On obtaining the lists each of the word forms was fixed with regard for labeling. When all word forms were entered into a computer card index, different forms of the word combined and their frequencies generalized, the frequency word list was obtained in the descending order of the absolute frequency of each unit, i.e. the probabilistic and statistical model of the specialty "Chemical Mechanical Engineering” was developed.

All the words were supplied with the following characteristics:

F  – absolute frequency;

F*  – absolute total frequency;

f  – relative frequency;

f*  relative total frequency.

As an example, we take the word system. Its statistical parameters were calculated as follows. (We will conditionally accept the amount of 200,000 word usages as the sample size).

1.                 F  – the number of word usages of that word in the whole sample was 811 units.

2.                 F* is the sum of absolute frequencies during the accumulation of the word: 811 + 69,841, i.e. 70,652.

3.                 f is the ratio of the absolute frequency of the word to the length of the entire sample: 811 : 200,000, i.e.  the value is determined up to the fifth digit inclusive.

4.                 f* is the sum of all previous relative frequencies, plus the relative frequency of the word, i.e., 0,00405 + 0,34921 = 0,35326.

The resulting frequency dictionary of the sublanguage "Chemical Mechanical Engineering" contains 6,589 different words, which is a 95% covering of the text corpus (190,700 word usages).  Five percent of the text corpus (9,300 word usages) cover the nonregistered/nonrelevant  units  – proper names, mathematical symbols, formulas, etc.

It should be noted that the developed probabilistic-statistical model of the technical specialty "Chemical Mechanical Engineering" enabled a large number of comparative studies, both in the field of theoretical linguistics at the morphological [2, 3], syntactic (or the area of the so-called "small syntax") [6 - 8] and lexical levels [6, 7], and to use its data for linguo-didactic purposes [5].

Reference

1. Andreev, N. D. Statistical and combinatorial methods in theoretical and applied linguistics / N. D. Andreev. L .: Nauka, 1967. 404 p.

2. Nevreva M. N. Genesis of nominal suffix morphemes in scientific communication texts (on the material of the English sublanguages of Electrical Engineering, Chemical and Process Engineering, and Motor Industry) /  M. N. Nevreva, L. E. Tsapenko, M. V. Tsinova //  Odessa Linguistic Bulletin. - Odessa: NUOLA, 2014. – ¹ 4.  – P. 332-335.

4. Scherba, L.V. Language system and speech activity / L.V.Scherba. L.: Nauka, 1974. 425 p.

3. Nevreva M. N. The analysis of suffix morphemes in a low-frequency zone of probabilistic and statistical models of sublanguages of scientific communication / M. N. Nevreva, T. I. Borisenko, E. L. Kapinus  //  XI ìåæäóíàðîäíà íàó÷íà ïðàêòè÷íà êîíôåðåíöèÿ «ÍÀÉÍÎÂÈÒÅ ÏÎÑÒÈÆÅÍÈß ÍÀ ÅÂÐÎÏÅÉÑÊÀÒÀ ÍÀÓÊÀ - 2015». -  Òîì 9. - Ôèëîëîãè÷íè íàóêè. – Ðåïóáëèêà Áúëãàðèÿ, ãð.Ñîôèÿ. - «Áÿë ÃÐÀÄ-Áû ÎÎÄ . – 2015. Ñ.59-63. (ðåã. ¹197820).

5. Shapa L. N., Dantsevich L. G., Dyachenko G. F. Introducing the results of theoretical research to the process of training the translation of  English texts of scientific communication in the technical high schools /  L. N. Shapa, L.G. Dantsevich, G. F. Dyachenko // Materiály XI mezinárodní vědecko - praktická konference «Aktuální vymoženosti vědy – 2015» –  Díl 5. Filologickévědy.Psychologie a sociologie. Politické vědy. Filosofie. Historie.Hudba a život.Tělovýchova a sport: Praha. Publishing House «Education and Science» s.r.o. – str. 3-10.

6. Tsinovaya M.V. Lexical Component of the Second Constituent of Model Verb Constriction in the Texts of Scientific-Technical Communication / M. V. Tsinovaya // Bulletin of Kharkiv National Karasin University. Series "Romanic and Germanic philology. Methods of teaching foreign languages ". –  Kharkiv: KNU. 2014 ã. – ¹ 1103. – P.155-159.

7. Tsinova M.V. The interaction between grammatical and lexical features of the constituents of modal constructions with the verb can (on the material of sublanguages of scientific-technical discourse) / M.V. Tsinova  // Young scientist. – 2014. – ¹2 (17). – Pt. V. – P. 132-136. [ISSN 2304-5809]. (Kherson).

8.Tsinovaya  M. V. Typology of models with must verb at the structural-semantic level (on the material of English sublanguages of scientific-technical discourse) / M. V. Tsinova // – Scientific Bulletin of International Humanitarian University. Series "Philology». – Odesa: IHU. ¹ 14. 2015. – P. 269-272.