Філологічні науки/ 3. Теоретичні і методологічні проблеми дослідження мови

Khrystyna Kulak

Lviv Polytechnic National University, Ukraine

Extracting semantic relations for Ukrainian WordNet

WordNet is a large lexical database of English, developed under the direction of George A.Miller from the Cognitive Science Laboratory, Princeton University [2]. The design of WordNet was inspired by psycholinguistic studies on the mental lexicon; consequently, the database has been built around the notion of concept. In WordNet four major parts of speech form separate files, containing synonymic sets (synsets), each representing one underlying lexical concept, which are organized into hierarchical structures. Synsets are interlinked by means of semantic and lexical relations.

Up to now, over 40 national wordnets have been developed, including numerous European, Asian, and African languages, as well as a number of domain-specific wordnets. Building a WordNet-like dictionary for the Ukrainian language would encourage development of numerous scientific fields belonging to the area of computational linguistics. It would stimulate further linguistic studies and enable our scientists to enter the international NLP community. It should also be reiterated that WordNet is a comprehensive lexical database. It is a useful dictionary, a reliable lexicographic source of reference.

Taking into account the technologies used to build a particular national WordNet or any other computer dictionary, thesaurus, etc., two main approaches can be singled out: computer-aided method and manual construction [3]. Forming a WordNet through automatic acquisition is definitely the fastest and least labour-intensive. Crafting the lexical database by hand, on the contrary, is much slower, time- and labour-consuming, and expensive. However, manually building a WordNet, provided that it is carried out by a team of competent linguists, can ensure higher quality and accuracy, while the computer-aided construction presupposes manual check and correction of the resulting data. In the end, though, it is not pure consideration of advantages and drawbacks that results in application of one method or the other; the choice in favour of one of the approaches is based on the availability of corresponding lexical resources.

One of the initial steps of a WordNet building process is selection of the so-called Base Concepts, i.e. the words that form the core of the lexicon for the target language. A few methods have been applied to perform this task. The Base Concepts present the most frequent words of the language under study and they can be extracted either from corpora, or explanatory dictionaries [7]. Lately, another technique has been extensively employed – the ‘free association test’ (FAT) [4]. This is a psycholinguistic approach. Generally, a list of words (stimuli) is presented to the respondents (either in writing, or orally), who are asked to name the first word that comes into their mind (responses). As opposed to other, more sophisticated forms of association experiments (e.g. controlled association test, priming, etc.); FAT gives the broadest information on the way knowledge is structured in the human mind [4].

The Base Concepts for the Ukrainian Language were defined in Lviv Polytechnic National University in 2009 by Khrystyna Khariv in her Master Thesis research and first attempt to create Ukrainian WordNet [1]. These concepts, nouns only, were used for creating Ukrainian WordNet in this research as well.

For this research and for building semantics relations were used two Ukrainian online-dictionaries - The Comprehensive Dictionary of the Contemporary Ukrainian Language [5] and Dictionary of Synonyms [6], which are available also in paper versions.

The initial data set (base concepts) consisted of 598 units, out of which 479 were present in The Comprehensive Dictionary and only 254 were included into Dictionary of Synonyms. Using the programming technologies, namely Python-written scripts, the dictionaries entries were extracted from online-dictionaries. After analysis of the entries, the most frequent patterns were defined. There are 19 of them: "Той, хто", "Дія за знач.", "Дія і стан за знач.", "Те саме, що", "Частина", "Місцевість", "Сукупність", "Сторона", "Напрям", "Діяльність", "Явище", "Факт", "Категорія", "Думка", "Відстань", "Простір", "Те, що", "Пор.", "Місце", "Дія".

Then, the lines with semantic relations were extracted from dictionary entries resulting in 510 semantic relations: hyponymy (178), synonymy (100), holonymy (58), sister term (51), action (49), meronymy (45), related (21), action-and-condition (8).

Following examples show in what way the semantic relations are represented:

<автор>_related(написати);

<вік 2>_synonym(сторіччя);

<закон 1>_hyponym(сукупність);

<група 3>_holonym(особа);

<удар>_sister(ляпас);

<сподівання 1>_action-and-condition(сподіватися);

<прохід 1>_action(проходити 1, 3, 6);

<вечір 1>_meronym(доба);

Numbers after the concept identify which sense of the word has the relation.

Some problems encountered during and after the research are:

-         morphological complexity of Ukrainian does not allow completely automatic extraction of the semantic relations, and so an important part of the work was done manually;

-         false relations, those that are not present in the Ukrainian language were built due to the automatic extraction of the lines with the patterns from dictionaries entries;

Nevertheless, 510 relations out from 733 dictionaries entries is a satisfying result. The plans for the future include building semantic relations with the help of POS tagger.

References:

1.     Khariv K. Implementation of hypernimic, hyponimic and antonymic relations for building a Ukrainian WordNet-like dictionary : Master Thesis / Khariv Khrystyna – Lviv, 2009. – 104 с.

2.     Princeton WordNet, official site. – Available at: www.wordnet.princeton.edu.

3.     Ramanand J., Ukey A., Kiran B.S., Bhattacharyya P. Mapping and StructuralAnalysis of Multi-lingual Wordnets // Data Engineering. – 2007. Vol. 30 No. 1. – P. 30-43. – [Cited 2009, 3 November]. – Available at: <http://www.it.iitb.ac.in/~ramanand/mtp/IEEE_DEB_Mar2007_Article_JRamanand.pdf>.

4.     Romanian WordNet: New Developments and Applications / D. Tufis, V. B. Mititelu, L. Bozianu, et al. // Proceedings of the Third International WordNet Conference. – Jeju Island, Korea, 2006. – P. 337-344. – [Cited 2009, 16 July]. – Available at: <http://nlpweb.kaist.ac.kr/gwc/pdf2006/gwc06.pdf >.

5.     Viacheslav B. The Comprehensive Dictionary of the Contemporary Ukrainian Language [Електронний ресурс] / Busel Viacheslav // Perun. – 2005. – Available at: http://www.lingvo-online.ru/en/LingvoDictionaries/Details? dictionary=Explanatory%20(Uk-Uk).

6.     Zubkov M. Dictionary of Synonyms [Електронний ресурс] / Mykola Zubkov // Vesna. – 2007. – Available at: http://www.lingvo-online.ru/en/Lingvo Dictionaries/Details?dictionary=Synonyms%20(Uk-Uk)

7.     Частотний словник української художньої прози: в 2-х т. / уклад. Дарчук Н.П. К.: Наукова думка, 1981. – 855 с.