Ôèëîëîãè÷åñêèå íàóêè/ 7. ßçûê, ðå÷ü, ðå÷åâàÿ
êîììóíèêàöèÿ
Cand. Sci. (Philology) Nevreva M.N., Duvanskaya I.F., Kudinova T.I.
Odessa National Polytechnic University
PROBABILISTIC-STATISTICAL MODEL AND
THE STAGES OF ITS CREATION (ON THE MATERIAL OF THE ENGLISH TEXT CORPUS “CHEMICAL
ENGINEERING”)
In
every field of human activity there is a constant selection of linguistic means to ensure the implementation of
information-communicative tasks, and this, in turn, causes the differentiation
of national language to a number of
functional styles – scientific, fiction, journalistic, and others.
In
studying the functional language features of any of these styles it is
expedient to single out separate sections from the general discourse and to
explore the totality of the language resources combined by the subjects of the
given communication area. The totality of units at all levels of the language
system structure constitutes a model of a particular subject area.
The
most effective method for selecting the
lexical material for linguistic research is to extract the object of analysis
from text corpora of the studied domain, i.e. the general totality currently represented by the special
electronic text corpora of national languages, namely, British National Corpus,
American National Corpus or Cambridge International Corpus.
This
technique, which was adhered to by
L.V.Scherba [1], requires the survey of large size of text sample on the
basis of which researchers are able to obtain valid and reliable results.
To create an adequate
probabilistic-statistical vocabulary model of a sublanguage the requirements to
the corpus are as follows:
1.
Strict attribution of texts to a
specific field of communication.
2.
Chronological limitations of the
text sample material.
3.
Completeness of texts regardless of
the length in word usages.
In
forming models the sample is quite often formed not from completed text but
from extracts with a certain amount of word forms, e.g. 1,000 to 10,000 units
[1, p.62]. With this approach, however, a great part of the significant
information on the sublanguage lexicon composition and lexical groups
stratification is lost.
4.
Sufficient size of the sample from
the electronic text corpus to obtain statistically reliable source material.
To determine the
object of research the method of overall survey is generally used. In view of
the fact that each subject area is quite complex, it is usually inaccessible to
direct observation and copying. In view of the fact that each subject area is
quite complex, it is not free for direct observation and copying. Therefore its
analog (model) is compiled, i.e. the semantic space, which is a certain field
of discourse, reflecting the portion of objective reality.
The
formation of semantic space is the first stage in the process of creating a
probabilistic and statistical model (frequency dictionary) of a particular area
of communication. This stage involves determining the nomenclature of sub-themes
of the semantic space and their sharing. This is followed by quantitative and
statistical calculations that indicate the units of the prospective frequency
list.
The
present work is devoted to the lexicographic study of scientific functional
style. The electronic text corpus of the technical specialty "Chemical Mechanical
Engineering" is taken as an object to be considered, where probabilistic
and statistical features typical for this type of style are exhibited. The
process of creating its model is described as well.
The
novelty of the work lies in the fact that such a model has been compiled for
the first time since there is no mention in the available literature of the
study, with exactly this branch of engineering serving as the research
material.
Textual
materials, which made the basis of objective data on the technical specialty
"Chemical Mechanical Engineering" were derived from the texts of the ten-year period English and
American magazines: Chemical Engineering, Chemical Processing, Chemical
Engineering Progress etc., which are part of the national English and American
corpora British National Corpus and American National Corpus, since these
corpora are characterized by not only oral speech subcorpus, but also text sets relating to scientific communication.
Overall
survey of the selected part of the electronic text corpus "Chemical Mechanical
Engineering" was carried out. The considered units included all
autosemantic and syntactic words, numerals written in words, conventional
abbreviations. Proper names, mathematical symbols, formulae and foreign
insertions were not taken into consideration.
Formation
of the semantic space which simulates the semantics of this field of scientific
communication occurred on the basis of peer
review of experts and specialists in the field of
technical knowledge "Chemical Mechanical
Engineering".
Sub-themes of semantic space of the sublanguage
"Chemical mechanical engineering" and their percentage ratio are
given below:
1.
Processes and apparatus of chemical
technology – 30%
2.
Chemical engineering machinery
design – 35%
3.
Corrosion of chemical equipment – 25%
4.
General chemical technology – 10%.
The
next stage is dedicated to creating the actual probabilistic-statistical model,
i.e. frequency dictionary of the "Chemical Mechanical Engineering" domain.
It consists of several steps:
1.
Compiling the alphabetical
ranking list of all word forms of the text.
2.
Drawing up a frequency list in which all word forms are arranged
in descending order of frequency.
3.
Reduction of all word forms of the obtained frequency list to the
main vocabulary units.
4.
Calculation of the statistical parameters of each word and identification of some common linguistic and information
regularities of the text under study.
In
compiling the alphabetical ranking list the marking method was used, allowing
to distinguish between grammatical and
lexical-grammatical homography. The marking system involves the use of marking
codes-indices expressed in the Latin alphabet. Word forms were distinguished on
the level of word classes, for example, pump
as a noun and pump as a verb; to as a preposition and to as a particle; finite and non-finite verb forms: worked as Past Indefinite and worked
as Past Participle; different functions of the verbs to have, to be, should, would.
On
obtaining the lists each of the word forms was fixed with regard for labeling.
When all word forms were entered into a computer card index, different forms of
the word combined and their frequencies generalized, the frequency word list
was obtained in the descending order of the absolute frequency of each unit, i.e.
the probabilistic and statistical model of the specialty "Chemical Mechanical
Engineering” was developed.
All
the words were supplied with the following characteristics:
F – absolute
frequency;
F* – absolute total frequency;
f – relative
frequency;
f* – relative total frequency.
As an example, we
take the word system. Its statistical
parameters were calculated as follows. (We will conditionally accept the amount
of 200,000 word usages as the sample size).
1.
F – the number
of word usages of that word in the whole sample was 811 units.
2.
F* is the sum of absolute frequencies during the accumulation of the word:
811 + 69,841, i.e. 70,652.
3.
f is the ratio of the absolute frequency of the word to
the length of the entire sample: 811 : 200,000, i.e. the value is determined up to the fifth digit inclusive.
4.
f* is the sum of all previous relative frequencies, plus
the relative frequency of the word, i.e., 0,00405 + 0,34921 = 0,35326.
The
resulting frequency dictionary of the sublanguage "Chemical Mechanical Engineering"
contains 6,589 different words, which is a 95% covering of the text corpus (190,700
word usages). Five percent of the text corpus
(9,300 word usages) cover the nonregistered/nonrelevant units
– proper names, mathematical symbols, formulas, etc.
It should be noted that the developed
probabilistic-statistical model of the technical specialty "Chemical Mechanical
Engineering" enabled a large number of comparative studies, both in the
field of theoretical linguistics at the morphological [2, 3], syntactic (or the
area of the so-called "small syntax") [6 - 8] and lexical levels [6,
7], and to use its data for linguo-didactic purposes [5].
Reference
1. Andreev, N. D. Statistical and combinatorial methods in
theoretical and applied linguistics / N. D. Andreev. – L .: Nauka,
1967. – 404 p.
2. Nevreva M. N. Genesis of nominal suffix morphemes in scientific
communication texts (on the material of the English sublanguages of Electrical
Engineering, Chemical and Process Engineering, and Motor Industry) / M. N. Nevreva, L. E. Tsapenko, M. V. Tsinova
// Odessa
Linguistic Bulletin. - Odessa: NUOLA, 2014. – ¹ 4. – P.
332-335.
4.
Scherba, L.V. Language system and speech activity / L.V.Scherba. –
L.: Nauka, 1974. – 425 p.
3. Nevreva M. N. The analysis of
suffix morphemes in a low-frequency zone of probabilistic and statistical
models of sublanguages of scientific communication / M. N. Nevreva, T. I.
Borisenko, E. L. Kapinus // XI ìåæäóíàðîäíà íàó÷íà ïðàêòè÷íà
êîíôåðåíöèÿ «ÍÀÉÍÎÂÈÒÅ ÏÎÑÒÈÆÅÍÈß ÍÀ ÅÂÐÎÏÅÉÑÊÀÒÀ ÍÀÓÊÀ - 2015». - Òîì 9. - Ôèëîëîãè÷íè íàóêè. – Ðåïóáëèêà
Áúëãàðèÿ, ãð.Ñîôèÿ. - «Áÿë ÃÐÀÄ-Áû ÎÎÄ . – 2015. Ñ.59-63. (ðåã. ¹197820).
5. Shapa L. N., Dantsevich
L. G., Dyachenko G. F. Introducing the results of theoretical research to the
process of training the translation of
English texts of scientific communication in the technical high schools
/ L. N. Shapa, L.G. Dantsevich, G. F.
Dyachenko // Materiály XI mezinárodní vědecko -
praktická konference «Aktuální vymoženosti vědy
– 2015» – Díl 5.
Filologickévědy.Psychologie a sociologie. Politické
vědy. Filosofie. Historie.Hudba a život.Tělovýchova a
sport: Praha. Publishing House «Education and Science» s.r.o.
– str. 3-10.
6. Tsinovaya M.V. Lexical Component
of the Second Constituent of Model Verb Constriction in the Texts of
Scientific-Technical Communication / M. V. Tsinovaya // Bulletin of Kharkiv National Karasin University.
Series "Romanic and Germanic philology. Methods of teaching foreign
languages ". – Kharkiv: KNU. – 2014 ã. – ¹ 1103. – P.155-159.
7. Tsinova M.V. The interaction between grammatical
and lexical features of the constituents of modal constructions with the verb can (on the material of sublanguages of
scientific-technical discourse) / M.V. Tsinova // Young scientist. – 2014. – ¹2 (17). – Pt. V. – P. 132-136. [ISSN 2304-5809]. (Kherson).
8.Tsinovaya M. V. Typology of models with must verb at the structural-semantic
level (on the material of English sublanguages of scientific-technical
discourse) / M. V. Tsinova // – Scientific Bulletin
of International Humanitarian University. Series "Philology». – Odesa:
IHU. – ¹ 14. – 2015.
– P. 269-272.