Філологічні науки/ 3. Теоретичні і методологічні проблеми дослідження
мови
Khrystyna Kulak
Lviv Polytechnic National University, Ukraine
Extracting semantic
relations for Ukrainian WordNet
WordNet is a large lexical database of English, developed under the
direction of George A.Miller from the Cognitive Science Laboratory, Princeton
University [2]. The design of WordNet was inspired by psycholinguistic studies
on the mental lexicon; consequently, the database has been built around the
notion of concept. In WordNet four major parts of speech form separate files,
containing synonymic sets (synsets), each representing one underlying lexical
concept, which are organized into hierarchical structures. Synsets are
interlinked by means of semantic and lexical relations.
Up to now, over 40 national wordnets have been developed, including
numerous European, Asian, and African languages, as well as a number of
domain-specific wordnets. Building a WordNet-like dictionary for the Ukrainian
language would encourage development of numerous scientific fields belonging to
the area of computational linguistics. It would stimulate further linguistic
studies and enable our scientists to enter the international NLP community. It
should also be reiterated that WordNet is a comprehensive lexical database. It
is a useful dictionary, a reliable lexicographic source of reference.
Taking into account the technologies used to build a particular national
WordNet or any other computer dictionary, thesaurus, etc., two main approaches
can be singled out: computer-aided method and manual construction [3].
Forming a WordNet through automatic acquisition is definitely the fastest and
least labour-intensive. Crafting the lexical database by hand, on the contrary,
is much slower, time- and labour-consuming, and expensive. However, manually
building a WordNet, provided that it is carried out by a team of competent
linguists, can ensure higher quality and accuracy, while the computer-aided
construction presupposes manual check and correction of the resulting data. In
the end, though, it is not pure consideration of advantages and drawbacks that
results in application of one method or the other; the choice in favour of one
of the approaches is based on the availability of corresponding lexical resources.
One of the initial steps of a WordNet building process is selection of
the so-called Base Concepts, i.e. the words that form the core of the lexicon
for the target language. A few methods have been applied to perform this task.
The Base Concepts present the most frequent words of the language under study
and they can be extracted either from corpora, or explanatory dictionaries [7].
Lately, another technique has been extensively employed – the ‘free association
test’ (FAT) [4]. This is a psycholinguistic approach. Generally, a list of
words (stimuli) is presented to the respondents (either in writing, or orally),
who are asked to name the first word that comes into their mind (responses). As
opposed to other, more sophisticated forms of association experiments (e.g.
controlled association test, priming, etc.); FAT gives the broadest information
on the way knowledge is structured in the human mind [4].
The Base Concepts for the Ukrainian Language were defined in Lviv
Polytechnic National University in 2009 by Khrystyna Khariv in her Master
Thesis research and first attempt to create Ukrainian WordNet [1]. These
concepts, nouns only, were used for creating Ukrainian WordNet in this research
as well.
For this research and for building semantics relations were used two
Ukrainian online-dictionaries - The Comprehensive Dictionary
of the Contemporary Ukrainian Language [5] and
Dictionary of Synonyms [6], which are available also in paper versions.
The initial data set (base concepts) consisted of 598 units, out of which 479 were present in The
Comprehensive Dictionary and only 254 were included into Dictionary of
Synonyms. Using the programming technologies, namely Python-written scripts,
the dictionaries entries were extracted from online-dictionaries. After analysis
of the entries, the most frequent patterns were defined. There are 19 of them:
"Той, хто", "Дія за знач.", "Дія і стан за
знач.", "Те саме, що", "Частина",
"Місцевість", "Сукупність", "Сторона",
"Напрям", "Діяльність", "Явище",
"Факт", "Категорія", "Думка",
"Відстань", "Простір", "Те, що",
"Пор.", "Місце", "Дія".
Then, the lines with semantic relations were extracted
from dictionary entries resulting in 510 semantic relations: hyponymy (178),
synonymy (100), holonymy (58), sister term (51), action (49), meronymy (45),
related (21), action-and-condition (8).
Following examples show in what way the semantic
relations are represented:
<автор>_related(написати);
<вік 2>_synonym(сторіччя);
<закон 1>_hyponym(сукупність);
<група 3>_holonym(особа);
<удар>_sister(ляпас);
<сподівання
1>_action-and-condition(сподіватися);
<прохід 1>_action(проходити 1, 3, 6);
<вечір 1>_meronym(доба);
Numbers after the concept identify which sense of the
word has the relation.
Some problems encountered during and after the
research are:
-
morphological complexity of Ukrainian does
not allow completely automatic extraction of the semantic relations, and so an
important part of the work was done manually;
-
false relations, those that are not
present in the Ukrainian language were built due to the automatic extraction of
the lines with the patterns from dictionaries entries;
Nevertheless, 510 relations out from 733 dictionaries
entries is a satisfying result. The plans for the future include building
semantic relations with the help of POS tagger.
References:
1. Khariv
K. Implementation of hypernimic, hyponimic and antonymic relations for building
a Ukrainian WordNet-like dictionary : Master Thesis / Khariv Khrystyna – Lviv,
2009. – 104 с.
2. Princeton
WordNet, official site. – Available at: www.wordnet.princeton.edu.
3. Ramanand J., Ukey A., Kiran B.S., Bhattacharyya P. Mapping and
StructuralAnalysis of Multi-lingual Wordnets // Data Engineering. – 2007. Vol.
30 No. 1. – P. 30-43. – [Cited 2009, 3 November]. – Available at: <http://www.it.iitb.ac.in/~ramanand/mtp/IEEE_DEB_Mar2007_Article_JRamanand.pdf>.
4. Romanian WordNet: New Developments and Applications / D. Tufis, V. B.
Mititelu, L. Bozianu, et al. // Proceedings of the Third International WordNet
Conference. – Jeju Island, Korea, 2006. – P. 337-344. – [Cited 2009, 16 July].
– Available at: <http://nlpweb.kaist.ac.kr/gwc/pdf2006/gwc06.pdf >.
5. Viacheslav
B. The Comprehensive Dictionary of the Contemporary Ukrainian Language
[Електронний ресурс] / Busel Viacheslav // Perun. – 2005. – Available at: http://www.lingvo-online.ru/en/LingvoDictionaries/Details? dictionary=Explanatory%20(Uk-Uk).
6. Zubkov
M. Dictionary of Synonyms [Електронний ресурс] / Mykola Zubkov // Vesna. –
2007. – Available at: http://www.lingvo-online.ru/en/Lingvo Dictionaries/Details?dictionary=Synonyms%20(Uk-Uk)
7. Частотний словник української художньої прози: в 2-х т. / уклад. Дарчук
Н.П. К.: Наукова думка, 1981. – 855 с.