Grammar or crammer? the role of morphology in distinguishing orthographically similar but semantically unrelated words

Ercan, Gökhan; Yıldız, Olcay Taner

Grammar or crammer? the role of morphology in distinguishing orthographically similar but semantically unrelated words

Dosyalar

Grammar_or_Crammer_The_Role_of_Morphology_in_Distinguishing_Orthographically_Similar_but_Semantically_Unrelated_Words_kopyası.pdf (11.44 MB)

Tarih

2025

Yazarlar

Ercan, Gökhan

Yıldız, Olcay Taner

Yayıncı

Institute of Electrical and Electronics Engineers Inc.

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

We show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset of nearly one million English and Turkish word-pairs that are orthographically similar but semantically unrelated (e.g., grammar - crammer). These pairs are generated through a graph-based WordNet approach and morphological resources. We define two evaluation tasks - unrelatedness identification and relatedness classification - to test semantic models. Our experiments reveal that FastText, with default n-gram segmentation, performs poorly (below 5% accuracy) in identifying unrelated words. However, morphological segmentation overcomes this issue, boosting accuracy to 68% (English) and 71% (Turkish) without compromising performance on standard benchmarks (RareWords, MTurk771, MEN, AnlamVer). Furthermore, our results suggest that even state-of-the-art LLMs, including Llama 3.3 and GPT-4o-mini, may exhibit noise in their semantic spaces, particularly in highly synthetic languages such as Turkish. To ensure dataset quality, we leverage WordNet, MorphoLex, and NLTK, covering fully derivational morphology supporting atomic roots (e.g., '-co_here+ance+y' for 'coherency'), with 405 affixes in Turkish and 467 in English.

Anahtar Kelimeler

Derivational morphology, Distributional semantic modeling, Language resource, Morphological segmentation, Orthographic similarity, Word-relatedness, Word-similarity, Economic and social effects, Semantic segmentation, Semantics, Distributional semantics, Semantic modelling, Turkishs, Modeling languages, Noise, Morphology, Grammar, Benchmark testing, Accuracy, Computational modeling, Training, Statistical analysis, Hands

Kaynak

IEEE Access

WoS Q Değeri

Q2

Scopus Q Değeri

Q1

Cilt

13

Künye

Ercan, G. & Yıldız, O. T. (2025). Grammar or crammer? the role of morphology in distinguishing orthographically similar but semantically unrelated words. IEEE Access, 13, 64412-64458. doi:https://doi.org/10.1109/ACCESS.2025.3557086

Bağlantı

https://hdl.handle.net/11729/6675
https://doi.org/10.1109/ACCESS.2025.3557086

Koleksiyon

Öğrenci Yayınları Makale Koleksiyonu
Lisansüstü Eğitim Enstitüsü Diğer Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu
WoS İndeksli Yayınlar Koleksiyonu

Detaylı Öğe Kaydı

Grammar or crammer? the role of morphology in distinguishing orthographically similar but semantically unrelated words

Dosyalar

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Araştırma projeleri

Organizasyon Birimleri

Dergi sayısı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon