Arama Sonuçları

Listeleniyor 1 - 6 / 6
  • Yayın
    A new approach for named entity recognition
    (IEEE, 2017) Ertopçu, Burak; Kanburoğlu, Ali Buğra; Topsakal, Ozan; Açıkgöz, Onur; Gürkan, Ali Tunca; Özenç, Berke; Çam, İlker; Avar, Begüm; Ercan, Gökhan; Yıldız, Olcay Taner
    Many sentences create certain impressions on people. These impressions help the reader to have an insight about the sentence via some entities. In NLP, this process corresponds to Named Entity Recognition (NER). NLP algorithms can trace a lot of entities in the sentence like person, location, date, time or money. One of the major problems in these operations are confusions about whether the word denotes the name of a person, a location or an organisation, or whether an integer stands for a date, time or money. In this study, we design a new model for NER algorithms. We train this model in our predefined dataset and compare the results with other models. In the end we get considerable outcomes in a dataset containing 1400 sentences.
  • Yayın
    Shallow parsing in Turkish
    (IEEE, 2017) Topsakal, Ozan; Açıkgöz, Onur; Gürkan, Ali Tunca; Kanburoğlu, Ali Buğra; Ertopçu, Burak; Özenç, Berke; Çam, İlker; Avar, Begüm; Ercan, Gökhan; Yıldız, Olcay Taner
    In this study, shallow parsing is applied on Turkish sentences. These sentences are used to train and test the per-formances of various learning algorithms with various features specified for shallow parsing in Turkish.
  • Yayın
    All-words word sense disambiguation for Turkish
    (IEEE, 2017) Açıkgöz, Onur; Gürkan, Ali Tunca; Ertopçu, Burak; Topsakal, Ozan; Özenç, Berke; Kanburoğlu, Ali Buğra; Çam, İlker; Avar, Begüm; Ercan, Gökhan; Yıldız, Olcay Taner
    Identifying the sense of a word within a context is a challenging problem and has many applications in natural language processing. This assignment problem is called word sense disambiguation(WSD). Many papers in the literature focus on English language and data. Our dataset consists of 1400 sentences translated to Turkish from the Penn Treebank Corpus. This paper seeks to address and discuss 6 different feature extraction methods and its classification performances using C4.5, Random Forests, Rocchio, Naive Bayes, KNN, Linear and multilayer Perceptron. This paper calls into question how the described features perform on a morphologically rich language (Turkish) with several classifiers.
  • Yayın
    A multilayer annotated corpus for Turkish
    (IEEE, 2018-06-06) Yıldız, Olcay Taner; Ak, Koray; Ercan, Gökhan; Topsakal, Ozan; Asmazoğlu, Cengiz
    In this paper, we present the first multilayer annotated corpus for Turkish, which is a low-resourced agglutinative language. Our dataset consists of 9,600 sentences translated from the Penn Treebank Corpus. Annotated layers contain syntactic and semantic information including morphological disambiguation of words, named entity annotation, shallow parse, sense annotation, and semantic role label annotation.
  • Yayın
    Türkçe anlamsal söylem ve cümle benzerliği analizleri için veri kümesi oluşturma yöntemi
    (IEEE, 2018-12-06) Ercan, Gökhan; Erkek, Orçun; Açıkgöz, Onur; Özçelik, Rıza; Parlar, Selen; Yıldız, Olcay Taner
    Çalışmamızın amacı Türkçe için paragraf-cümle düzeyinde anlamsal söylem analizi ve paragraf-cümle ve cümle-cümle düzeyinde metinsel benzerlik ölçümlemesi için bir veri kümesi hazırlamaktır.
  • Yayın
    Grammar or crammer? the role of morphology in distinguishing orthographically similar but semantically unrelated words
    (Institute of Electrical and Electronics Engineers Inc., 2025) Ercan, Gökhan; Yıldız, Olcay Taner
    We show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset of nearly one million English and Turkish word-pairs that are orthographically similar but semantically unrelated (e.g., grammar - crammer). These pairs are generated through a graph-based WordNet approach and morphological resources. We define two evaluation tasks - unrelatedness identification and relatedness classification - to test semantic models. Our experiments reveal that FastText, with default n-gram segmentation, performs poorly (below 5% accuracy) in identifying unrelated words. However, morphological segmentation overcomes this issue, boosting accuracy to 68% (English) and 71% (Turkish) without compromising performance on standard benchmarks (RareWords, MTurk771, MEN, AnlamVer). Furthermore, our results suggest that even state-of-the-art LLMs, including Llama 3.3 and GPT-4o-mini, may exhibit noise in their semantic spaces, particularly in highly synthetic languages such as Turkish. To ensure dataset quality, we leverage WordNet, MorphoLex, and NLTK, covering fully derivational morphology supporting atomic roots (e.g., '-co_here+ance+y' for 'coherency'), with 405 affixes in Turkish and 467 in English.