MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling

Rob van der Goot, Anette Jensen, Emil Allerslev Schledermann, Mikkel Wildner Kildeberg, Nicolaj Larsen, Mike Zhang, Elisa Bassignana

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

Abstract

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a wide variety of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an $F_1$ of 44.8 compared to 58.6 for an unsupervised morphological segmented (Morfessor). Furthermore, we confirm that the more closely our tokenizer resembles morphemes, the higher the performance of LMs.
Original languageEnglish
Title of host publicationProceedings of the 25th Nordic Conference on Computational Lingustics and 11th Baltic Conference on Human Language Technologies : NoDaLiDa/Baltic-HLT 2025
PublisherNorthern European Association for Language Technology
Publication dateMar 2025
Publication statusPublished - Mar 2025

Fingerprint

Dive into the research topics of 'MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling'. Together they form a unique fingerprint.

Cite this