Işık Üniversitesi Kurumsal Akademik Belleği :: DSpace Angular

In this article, we summarize the methodology and the results of our 2-year-long efforts to construct a comprehensive WordNet for Turkish. In our approach, we mine a dictionary for synonym candidate pairs and manually mark the senses in which the candidates are synonymous. We marked every pair twice by different human annotators. We derive the synsets by finding the connected components of the graph whose edges are synonym senses. We also mined Turkish Wikipedia for hypernym relations among the senses. We analyzed the resulting WordNet to highlight the difficulties brought about by the dictionary construction methods of lexicographers. After splitting the unusually large synsets, we used random walk-based clustering that resulted in a Zipfian distribution of synset sizes. We compared our results to BalkaNet and automatic thesaurus construction methods using variation of information metric. Our Turkish WordNet is available online.

In this thesis, we summarize the methodology and the results of our e?orts to construct a comprehensive WordNet for Turkish. Most languages have access to comprehensive language resources. Traditional resources like bilingual dictionaries, monolingual dictionaries, thesauri and lexicons are developed by lexicographers. As computer processing of languages gain popularity, a new set of resources become necessary. One such resource is WordNet which was initially constructed for English language in Princeton University. A WordNet contains much of the information contained in a classic dictionary, but it also contains additional relationship information. These relations go beyond synonym relation and give information about relations such as a word being“is-a” or “is-a-part-of” another. These semantic relations are used in many text analysis tasks. A WordNet also categorizes words under common concepts. These concepts are called as synsets. As a result of all these, WordNet is a comprehensive dictionary which is readable by the computers and a useful language resource for text analysis and other research based on human language. In Turkish language, our WordNet is not the ?rst. The previous WordNet is part of BalkaNet project which is a multilingual WordNet including Turkish and Balkan languages. BalkaNet contains only common words between these languages, as such BalkaNet does not contain all Turkish words and su?ers from top-down constructing method disadvantages. BalkaNet project has not been updated or expanded in recent years. In this work we construct a Turkish WordNet from scratch using a bottom-up method. In general there are two methods for constructing WordNets. Bottomup method means that we create the WordNet from scratch while top-down approach uses other WordNets by translating them. We use Turkish Contemporary Dictionary (CDT) which is an online Turkish dictionary provided by Turkish Language Institute. Bottom-up approach has its own di?culties, since constructing a WordNet from scratch requires more resources and a lot of e?ort. In this work, we extract synonyms from CDT and ask experts to match common meanings for pairs of synonyms. We developed an application which makes annotation step easier and more accurate. We also use two groups of annotators to measure inter-annotator agreement. We used some automatic approaches to extract semantic relations from Turkish Wikipedia (Vikipedi) and Vikisözlük. We processed CDT to extract candidate synonyms and used rule based approaches to ?nd synonym sets. There is no thesaurus for Turkish, so as an application we construct a thesaurus automatically and measured accuracy with our manually constructed synsets. We named our WordNet “KeNet”. Finally, in this thesis we developed a novel approach to represent a text document in a vector space. This approach uses WordNet semantic relations. This part of thesis is an application of KeNet. We used our approach to represent text documents and implemented two di?erent clustering algorithms over these vectors. We tested our method over Turkish Wikipedia articles, domains of which are labeled by Wikipedia.

Filtreler

Yazar

Konu

Tarih

İndeks

WoS Q

Scopus Q

Dil

Tür

Kategori

Bölüm

Erişim Hakkı

Tam Metin

Öğe Türü

Ayarlar

Sırala

Sayfa Başına Sonuç

Arama Sonuçları