What is "Typological Diversity" in NLP?

Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

1 Citation (Scopus)
36 Downloads (Pure)

Abstract

The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world’s languages. An increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being typologically diverse. In this meta-analysis, we systematically investigate NLP research that includes claims regarding typological diversity. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of resulting language samples along several axes and find that the results vary considerably across papers. Crucially, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of typological diversity that empirically justifies the diversity of language samples. To help facilitate this, we release the code for our diversity measures.
Original languageEnglish
Title of host publicationProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing : EMNLP
Number of pages20
PublisherAssociation for Computational Linguistics
Publication dateNov 2024
Pages5681-5700
ISBN (Electronic)979-8-89176-164-3
DOIs
Publication statusPublished - Nov 2024
EventThe 2024 Conference on Empirical Methods in Natural
Language Processing
- Miami, United States
Duration: 12 Nov 202416 Nov 2024
https://2024.emnlp.org/

Conference

ConferenceThe 2024 Conference on Empirical Methods in Natural
Language Processing
Country/TerritoryUnited States
CityMiami
Period12/11/202416/11/2024
Internet address

Keywords

  • Multilingual NLP
  • Typology
  • NLP
  • Language Models
  • Diversity

Fingerprint

Dive into the research topics of 'What is "Typological Diversity" in NLP?'. Together they form a unique fingerprint.
  • A Call for Consistency in Reporting Typological Diversity

    Poelman, W., Ploeger, E., de Lhoneux, M. & Bjerva, J., 17 Mar 2024, SIGTYP 2024 - 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop. Hahn, M., Sorokin, A., Kumar, R., Shcherbakov, A., Otmakhova, Y., Yang, J., Serikov, O., Rani, P., Ponti, E. M., Muradoglu, S., Gao, R., Cotterell, R. & Vylomova, E. (eds.). Association for Computational Linguistics, p. 75-77 3 p.

    Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

    Open Access
    File

Cite this