A Large Scale Test Corpus for Semantic Table Search

Aristotelis Leventidis, Martin Pekár Christensen, Matteo Lissandrini, Laura Di Rocco, Katja Hose, Renée J. Miller

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

2 Citations (Scopus)
32 Downloads (Pure)

Abstract

Table search aims to answer a query with a ranked list of tables. Unfortunately, current test corpora have focused mostly on needle-in-the-haystack tasks, where only a few tables are expected to exactly match the query intent. Instead, table search tasks often arise in response to the need for retrieving new datasets or augmenting existing ones, e.g., for data augmentation within data science or machine learning pipelines. Existing table repositories and benchmarks are limited in their ability to test retrieval methods for table search tasks. Thus, to close this gap, we introduce a novel dataset for query-by-example Semantic Table Search. This novel dataset consists of two snapshots of the large-scale Wikipedia tables collection from 2013 and 2019 with two important additions: (1) a page and topic aware ground truth relevance judgment and (2) a large-scale DBpedia entity linking annotation. Moreover, we generate a novel set of entity-centric queries that allows testing existing methods under a novel search scenario: semantic exploratory search. The resulting resource consists of 9,296 novel queries, 610,553 query-table relevance annotations, and 238,038 entity-linked tables from the 2013 snapshot. Similarly, on the 2019 snapshot, the resource consists of 2,560 queries, 958,214 relevance annotations, and 457,714 total tables. This makes our resource the largest annotated table-search corpus to date (97 times more queries and 956 times more annotated tables than any existing benchmark). We perform a user study among domain experts and prove that these annotators agree with the automatically generated relevance annotations. As a result, we can re-evaluate some basic assumptions behind existing table search approaches identifying their shortcomings along with promising novel research directions.
Translated title of the contributionEn Stor-Skala Test Samling for Semantic Tabelsøgning
Original languageEnglish
Title of host publicationSIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
Number of pages10
Place of PublicationNew York, USA
PublisherAssociation for Computing Machinery (ACM)
Publication date10 Jul 2024
Edition46
Pages1142-1151
ISBN (Electronic)9798400704314
DOIs
Publication statusPublished - 10 Jul 2024
EventSIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval - Washington, United States
Duration: 14 Jul 202418 Jul 2024
https://dl.acm.org/doi/proceedings/10.1145/3626772

Conference

ConferenceSIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
Country/TerritoryUnited States
CityWashington
Period14/07/202418/07/2024
Internet address

Keywords

  • benchmark
  • query-by-example
  • semantic search
  • table search

Fingerprint

Dive into the research topics of 'A Large Scale Test Corpus for Semantic Table Search'. Together they form a unique fingerprint.

Cite this