Projekter pr. år
Abstract
The strengths of subword tokenization have been widely demonstrated when applied to higher-resourced, morphologically simple languages.
However, it is not self-evident that these results transfer to lower-resourced, morphologically complex languages.
In this work, we investigate the influence of different subword segmentation techniques on machine translation between Danish and Kalaallisut, the official language of Greenland.
We present the first semi-manually aligned parallel corpus for this language pair, and use it to
compare subwords from unsupervised tokenizers and morphological segmenters.
We find that Unigram-based segmentation both preserves morphological boundaries and handles out-of-vocabulary words adequately, but that this does not directly correspond to superior translation quality.
We hope that our findings lay further groundwork for future efforts in neural machine translation for Kalaallisut.
However, it is not self-evident that these results transfer to lower-resourced, morphologically complex languages.
In this work, we investigate the influence of different subword segmentation techniques on machine translation between Danish and Kalaallisut, the official language of Greenland.
We present the first semi-manually aligned parallel corpus for this language pair, and use it to
compare subwords from unsupervised tokenizers and morphological segmenters.
We find that Unigram-based segmentation both preserves morphological boundaries and handles out-of-vocabulary words adequately, but that this does not directly correspond to superior translation quality.
We hope that our findings lay further groundwork for future efforts in neural machine translation for Kalaallisut.
Originalsprog | Engelsk |
---|---|
Titel | Proceedings of the 25th Nordic Conference on Computational Lingustics and 11th Baltic Conference on Human Language Technologies : NoDaLiDa/Baltic-HLT 2025 |
Publikationsdato | mar. 2025 |
Status | Accepteret/In press - jan. 2025 |
Begivenhed | The 25th Nordic Conference on Computational Lingustics and 11th Baltic Conference on Human Language Technologies - Hestia Hotel Europa, Tallinn, Estland Varighed: 2 mar. 2025 → 5 mar. 2025 https://www.nodalida-bhlt2025.eu/ |
Konference
Konference | The 25th Nordic Conference on Computational Lingustics and 11th Baltic Conference on Human Language Technologies |
---|---|
Lokation | Hestia Hotel Europa |
Land/Område | Estland |
By | Tallinn |
Periode | 02/03/2025 → 05/03/2025 |
Internetadresse |
Fingeraftryk
Dyk ned i forskningsemnerne om 'Tokenization on Trial: The Case of Kalaallisut–Danish Legal Machine Translation'. Sammen danner de et unikt fingeraftryk.Projekter
- 1 Igangværende
-
Multilingual Modelling for Resource-Poor Languages
Bjerva, J. (PI (principal investigator)), Lent, H. C. (Projektdeltager), Chen, Y. (Projektdeltager), Ploeger, E. (Projektdeltager), Fekete, M. R. (Projektdeltager) & Lavrinovics, E. (Projektdeltager)
01/09/2022 → 31/08/2025
Projekter: Projekt › Forskning