Projekter pr. år
Abstract
Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.
Originalsprog | Engelsk |
---|---|
Antal sider | 20 |
DOI | |
Status | Afsendt - nov. 2024 |
Fingeraftryk
Dyk ned i forskningsemnerne om 'How Good is Your Wikipedia?'. Sammen danner de et unikt fingeraftryk.Projekter
- 1 Igangværende
-
Multilingual Modelling for Resource-Poor Languages
Bjerva, J. (PI (principal investigator)), Lent, H. C. (Projektdeltager), Chen, Y. (Projektdeltager), Ploeger, E. (Projektdeltager), Fekete, M. R. (Projektdeltager) & Lavrinovics, E. (Projektdeltager)
01/09/2022 → 31/08/2025
Projekter: Projekt › Forskning