Projects per year
Abstract
Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a wide variety of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an $F_1$ of 44.8 compared to 58.6 for an unsupervised morphological segmented (Morfessor). Furthermore, we confirm that the more closely our tokenizer resembles morphemes, the higher the performance of LMs.
Original language | English |
---|---|
Title of host publication | Proceedings of the 25th Nordic Conference on Computational Lingustics and 11th Baltic Conference on Human Language Technologies : NoDaLiDa/Baltic-HLT 2025 |
Publisher | Northern European Association for Language Technology |
Publication date | Mar 2025 |
Publication status | Published - Mar 2025 |
Fingerprint
Dive into the research topics of 'MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling'. Together they form a unique fingerprint.Projects
- 1 Active
-
Digital Twins for Abundant Feedback: Novel Feedback Paradigms via Explainable Multilingual Natural Language Processing
Bjerva, J. (PI), Lindsay, E. (PI) & Zhang, M. (Project Participant)
01/01/2024 → 31/12/2025
Project: Research