MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling

Rob van der Goot, Anette Jensen, Emil Allerslev Schledermann, Mikkel Wildner Kildeberg, Nicolaj Larsen, Mike Zhang, Elisa Bassignana

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

Abstract

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a wide variety of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an $F_1$ of 44.8 compared to 58.6 for an unsupervised morphological segmented (Morfessor). Furthermore, we confirm that the more closely our tokenizer resembles morphemes, the higher the performance of LMs.
OriginalsprogEngelsk
TitelProceedings of the 25th Nordic Conference on Computational Lingustics and 11th Baltic Conference on Human Language Technologies : NoDaLiDa/Baltic-HLT 2025
ForlagNorthern European Association for Language Technology
Publikationsdatomar. 2025
StatusUdgivet - mar. 2025

Fingeraftryk

Dyk ned i forskningsemnerne om 'MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling'. Sammen danner de et unikt fingeraftryk.

Citationsformater