KeNet: a comprehensive Turkish wordNet and its applications in text clustering

Ehsani, Razieh

KeNet: a comprehensive Turkish wordNet and its applications in text clustering

dc.contributor.advisor	Yıldız, Olcay Taner	en_US
dc.contributor.author	Ehsani, Razieh	en_US
dc.contributor.other	Işık Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği Doktora Programı	en_US
dc.date.accessioned	2018-11-23T03:02:57Z
dc.date.available	2018-11-23T03:02:57Z
dc.date.issued	2018-06-07
dc.department	Işık Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği Doktora Programı	en_US
dc.description	Text in English ; Abstract: English and Turkish	en_US
dc.description	Includes bibliographical references (leaves 72-80)	en_US
dc.description	xiv, 80 leaves	en_US
dc.description.abstract	In this thesis, we summarize the methodology and the results of our e?orts to construct a comprehensive WordNet for Turkish. Most languages have access to comprehensive language resources. Traditional resources like bilingual dictionaries, monolingual dictionaries, thesauri and lexicons are developed by lexicographers. As computer processing of languages gain popularity, a new set of resources become necessary. One such resource is WordNet which was initially constructed for English language in Princeton University. A WordNet contains much of the information contained in a classic dictionary, but it also contains additional relationship information. These relations go beyond synonym relation and give information about relations such as a word being“is-a” or “is-a-part-of” another. These semantic relations are used in many text analysis tasks. A WordNet also categorizes words under common concepts. These concepts are called as synsets. As a result of all these, WordNet is a comprehensive dictionary which is readable by the computers and a useful language resource for text analysis and other research based on human language. In Turkish language, our WordNet is not the ?rst. The previous WordNet is part of BalkaNet project which is a multilingual WordNet including Turkish and Balkan languages. BalkaNet contains only common words between these languages, as such BalkaNet does not contain all Turkish words and su?ers from top-down constructing method disadvantages. BalkaNet project has not been updated or expanded in recent years. In this work we construct a Turkish WordNet from scratch using a bottom-up method. In general there are two methods for constructing WordNets. Bottomup method means that we create the WordNet from scratch while top-down approach uses other WordNets by translating them. We use Turkish Contemporary Dictionary (CDT) which is an online Turkish dictionary provided by Turkish Language Institute. Bottom-up approach has its own di?culties, since constructing a WordNet from scratch requires more resources and a lot of e?ort. In this work, we extract synonyms from CDT and ask experts to match common meanings for pairs of synonyms. We developed an application which makes annotation step easier and more accurate. We also use two groups of annotators to measure inter-annotator agreement. We used some automatic approaches to extract semantic relations from Turkish Wikipedia (Vikipedi) and Vikisözlük. We processed CDT to extract candidate synonyms and used rule based approaches to ?nd synonym sets. There is no thesaurus for Turkish, so as an application we construct a thesaurus automatically and measured accuracy with our manually constructed synsets. We named our WordNet “KeNet”. Finally, in this thesis we developed a novel approach to represent a text document in a vector space. This approach uses WordNet semantic relations. This part of thesis is an application of KeNet. We used our approach to represent text documents and implemented two di?erent clustering algorithms over these vectors. We tested our method over Turkish Wikipedia articles, domains of which are labeled by Wikipedia.	en_US
dc.description.abstract	Bu tez, kapsamlı bir Türkçe WordNet yapımının aşamalarını, zorluklarını ve son olarak da onu bir doğal işleme alanında uygulamasını özetliyor. Her dilin kendine özel dil kaynakları vardır, örneğin tek dilli sözlükler, iki dilli sözlükler, lugatnameler klasik dil kaynaklarıdırlar ve dilbilimciler tarafından geliştirirlirler. Bu kaynaklar genellikle bir dil kurumu tarafından desteklenir ve denetlenir. Günümüz bilgisayarların hayatımızın her alanına girmesi ile birlikte, dil kaynaklarının da bilgisayarlar tarafından okunabilirliği ve bilgisayar uygulamalarında kullanılabilmeleri için geliştirilmeleri bir gereksinim haline gelmiştir. Bu bilgisayar tarafından okunabilir kaynaklardan biri WordNettir, WordNet ilk kez İngilizce için Princeton Üniversitesinde geliştirilmiştir. WordNet klasik sözlüklerin özelliklerini taşımakla birlikte kelimeler arasında bazı anlamsal ilişkileri de içerir. Bu anlamsal ilişkiler eş anlamlılıktan öte, bir kelime diğerinin bir türüdür, veya bir kelime diğer kelimenin bir parçasıdır gibi anlamsal ilişkileri de içerir. Bu anlamsal ilişkiler yazı analizlerinde kullanılmaktadır. WordNet kelimeleri gerçek dünyadaki kavramlarına göre tek bir kümede toplar, bu kümelere synset denir. Sonuç olarak WordNet, kapsamlı ve bilgisayar tarafından okunabilir bir dil kaynağıdır ve yazı analizlerinde oldukça faydalı bir kaynaktır. Türkçe için bizim çalışmamızdan önce kapsamlı olmayan bir WordNet geliştirilmiş. Bu WordNet, BalkaNet projesinin adı altında geliştirilmiştir. BalkaNet çokdilli bir WordNettir ve Balkan dilleri ve Türkçeyi içermektedir. BalkaNet aşamalar sırasında geliştirilmiş ve anlamsal ilişkiler eklenmiştir, fakat son yıllarda herhangi bir güncelleme yapılmamıştır. Bu çalışma, sıfırdan Türkçe için bir WordNet yapımını anlatmaktadır. Genel olarak, WordNet yapımı için iki yöntem vardır, aşağı-yukarı yöntem ve yukarıdana¸sağı yöntem. aşağı-yukarı yöntem herhangi başka bir WordNeti çevirmeden veya kullanmadan sıfırdan ve sözlük kullanarak WordNet yapımıyla uğraşır, yukarıaşağı yöntemde ise, sıfırdan yapmak yerine başka dillerde mevcut olan WordNetleri birebir çevirerek ve dahasında geliştirerek veyahut değiştirmeyerek WordNet yapımıyla uğra¸sır. Bizim C¸alışmamız Türk Dil Kurumunun Güncel Türkçe Sözlüğünü kullanarak aşağı-yukarı yöntem ile WordNet yapımıdır. Bu çalışma sırasında, TDK sözlüğünden eşanlamlı kelimeleri çıkartıp ve bir grup insana bu kelimelerin ortaklaşa paylaştıkları anlamları işaretlemelerini istedik. Bu işaretleme için geliştirdiğimiz bir yazılım kullanarak sürecin kolaylaşmasını ve hata payının düşürülmesini sağladık. Ayrıca Türkçe için herhangi bir eşanlamlılar sözlüğü mevcur olmadığı için, Türkçenin ilk eşanlamlılar sözlüğünü otomatik olarak oluşturduk. İşaretleyiciler arasında anlaşmayı ölçüp ve ayrıca otomatik oluşturduğumuz eşanlamlılar sözlüğünü elle işaretlenmiş eşanlamlılar kümelerile ölçtük. Son olarak, bu çalışmada geliştirdiğimiz WordNeti Vikipedi makalelerini kümelemesi için kullandık. Bunun için öncelikle her yazı dosyasını bir vektöre çevirdik ve bunun için kendi özel yöntemimizi kullandık.	en_US
dc.description.sponsorship	This study was supported by The Scientific and Techonological Research Council of Turkey (TÜBİTAK) Grant No:116E104	en_US
dc.description.tableofcontents	Turkish language	en_US
dc.description.tableofcontents	WordNet	en_US
dc.description.tableofcontents	WordNets in Other Languages	en_US
dc.description.tableofcontents	Manual WordNet construction	en_US
dc.description.tableofcontents	Lexical resource	en_US
dc.description.tableofcontents	Sense granularity	en_US
dc.description.tableofcontents	Productive derivations	en_US
dc.description.tableofcontents	Processing the Dictionary	en_US
dc.description.tableofcontents	Synonym candidates	en_US
dc.description.tableofcontents	Handling MWEs	en_US
dc.description.tableofcontents	Manual Annotation	en_US
dc.description.tableofcontents	Special Cases	en_US
dc.description.tableofcontents	Inter-annotator agreement	en_US
dc.description.tableofcontents	Synset construction	en_US
dc.description.tableofcontents	Synset statistics	en_US
dc.description.tableofcontents	Semantic relations	en_US
dc.description.tableofcontents	Antonyms	en_US
dc.description.tableofcontents	Hypernyms and hyponyms	en_US
dc.description.tableofcontents	Hypernym-hyponym in CDT	en_US
dc.description.tableofcontents	Hypernym-hyponym in Vikipedi and Vikisözlük	en_US
dc.description.tableofcontents	Domain	en_US
dc.description.tableofcontents	Automatic WordNet Construction	en_US
dc.description.tableofcontents	Automatic thesaurus	en_US
dc.description.tableofcontents	Comparison of Synsets	en_US
dc.description.tableofcontents	Related work on clustering text	en_US
dc.description.tableofcontents	Semantic Similarity	en_US
dc.description.tableofcontents	Topological similarity	en_US
dc.description.tableofcontents	Statistical similarity	en_US
dc.description.tableofcontents	Content based clustering	en_US
dc.description.tableofcontents	Textual graph	en_US
dc.description.tableofcontents	Preprocessing data	en_US
dc.description.tableofcontents	Morphological analyze	en_US
dc.description.tableofcontents	Morphological disambiguation	en_US
dc.description.tableofcontents	Convert words to the dictionary entries	en_US
dc.description.tableofcontents	Getting rid of redundant words	en_US
dc.description.tableofcontents	Constructing textual graph	en_US
dc.description.tableofcontents	Representing text	en_US
dc.description.tableofcontents	Disambiguating synsets	en_US
dc.description.tableofcontents	Representatives for synsets	en_US
dc.description.tableofcontents	Co-occurrence graph	en_US
dc.description.tableofcontents	Textual graph analysis	en_US
dc.description.tableofcontents	Jaccard Similarity	en_US
dc.description.tableofcontents	Generalized Jaccard similarity	en_US
dc.description.tableofcontents	PageRank	en_US
dc.description.tableofcontents	Experimental results for clustering headlines	en_US
dc.description.tableofcontents	Page2Vec algorithm	en_US
dc.description.tableofcontents	K-means clustering	en_US
dc.description.tableofcontents	Hierarchical clustering	en_US
dc.identifier.citation	Ehsani, R. (2018). KeNet: a comprehensive Turkish wordNet and its applications in text clustering. İstanbul: Işık Üniversitesi Fen Bilimleri Enstitüsü.	en_US
dc.identifier.uri	https://hdl.handle.net/11729/1392
dc.institutionauthor	Ehsani, Razieh	en_US
dc.language.iso	en	en_US
dc.publisher	Işık Üniversitesi	en_US
dc.relation.publicationcategory	Tez	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Graph-based	en_US
dc.subject	NLP	en_US
dc.subject	Semantic	en_US
dc.subject	Sense	en_US
dc.subject	Text analysis	en_US
dc.subject	Text clustering	en_US
dc.subject	Turkish NLP	en_US
dc.subject	WordNet	en_US
dc.subject	Anlam	en_US
dc.subject	Dil	en_US
dc.subject	Graph tabanlı çözümleme	en_US
dc.subject	Metin ayrıştırma	en_US
dc.subject	Türkçe	en_US
dc.subject	Türkçe doğal dil işleme	en_US
dc.subject	Yazı çözümleme	en_US
dc.subject.lcc	P98.45.T9 E37 2018
dc.subject.lcsh	Computational linguistics -- Turkey	en_US
dc.subject.lcsh	Text processing (Computer science)	en_US
dc.title	KeNet: a comprehensive Turkish wordNet and its applications in text clustering	en_US
dc.title.alternative	KeNet: kapsamlı Türkçe wordnet ve metin kümelemede kullanılması	en_US
dc.type	Doctoral Thesis	en_US
dspace.entity.type	Publication

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1

İsim:: Razieh_Ehsani.pdf
Boyut:: 2.01 MB
Biçim:: Adobe Portable Document Format
Açıklama:: DoctoralThesis

İndir

Lisans paketi

Listeleniyor 1 - 1 / 1

İsim:: license.txt
Boyut:: 1.71 KB
Biçim:: Item-specific license agreed upon to submission
Açıklama:

İndir

Koleksiyon

Lisansüstü Eğitim Enstitüsü Tez Koleksiyonu