Can Humans Identify Domains?

Maria Barrett, Max Müller-Eberstein, Elisa Bassignana, Amalie Brogaard Pauli, Mike Zhang, Rob van der Goot

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

1 Citation (Scopus)
31 Downloads (Pure)

Abstract

Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance. The concept itself is, however, loosely defined and, in practice, refers to any non-typological property, such as genre, topic, medium or style of a document. We investigate the core notion of domains via the human proficiency in identifying related intrinsic textual properties, specifically the concepts of genre (communicative purpose) and topic (subject matter), and publish our annotations in TGeGUM: A collection of 9.1k sentences from the GUM dataset (Zeldes, 2017) with single sentence and larger context (i.e., prose) annotations for one of 11 genres (source type), and its topic/subtopic as per the Dewey Decimal library classification system (Dewey, 1979), consisting of 10/100 hierarchical topics of increased granularity. Each instance is annotated by three annotators, for a total of 32.7k annotations, allowing us to examine the level of human disagreement and the relative difficulty of each annotation task. With a Fleiss’ kappa of at most 0.53 on the sentence level and 0.66 at the prose level, it is evident that despite the ubiquity of domains in NLP, there is actually little human consensus on how to define them. By training classifiers to perform the same task, we find that this uncertainty also extends to NLP models.
Original languageEnglish
Title of host publicationThe 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Number of pages21
PublisherEuropean Language Resources Association
Publication dateMay 2024
Pages2745–2765
Publication statusPublished - May 2024
EventLREC-COLING 2024 - Lingotto Conference Centre, Torino, Italy
Duration: 20 May 202425 May 2024
https://lrec-coling-2024.org/

Conference

ConferenceLREC-COLING 2024
LocationLingotto Conference Centre
Country/TerritoryItaly
CityTorino
Period20/05/202425/05/2024
Internet address
SeriesProceedings of International Conference on Computational Linguistics (COLING)
ISSN2951-2093

Keywords

  • domain
  • genre
  • topic
  • multi-annotation

Fingerprint

Dive into the research topics of 'Can Humans Identify Domains?'. Together they form a unique fingerprint.

Cite this