A Study on Efficient Indexing for Table Search in Data Lakes

Ibraheem Taha, Matteo Lissandrini, Alkis Simitsis, Yannis Ioannidis

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

2 Citations (Scopus)
5 Downloads (Pure)

Abstract

Data lakes store diverse and large volumes of datasets. One of the core challenges in data lakes is dataset discovery, which involves tasks such as finding related tables, domain discovery, and column clustering. In this paper, we focus on a popular approach for finding related tables in public or private data lakes, namely table search. Given the heterogeneity of the tables in a data lake, recent methods adopt table-representation learning and produce dense vector representations for every row, column, or even cell value. This enables advanced indexing techniques, such as HSNW, LSH, and DiskANN, which implement efficient data-structures to speed-up the core operation of approximate k-NN search in such vector spaces. However, while many indexing techniques have been employed so far, their practical value and effectiveness governed by the tradeoff of accuracy vs. performance have not been explored yet. In this paper, we aim at shedding light on this gap. We start with an overview of state-of-the-art techniques for table search in data lakes that are based on vector-search operations. Then, we present an in-depth analysis of the performances of the k-ANN indexes and techniques they adopt. This allows us to map for the first time the space of alternative implementations for these techniques when applied to data lakes, revealing strengths and weaknesses of each option, and further delineating exciting novel research directions.
Original languageEnglish
Title of host publicationProceedings - 18th IEEE International Conference on Semantic Computing, ICSC 2024
Number of pages8
PublisherIEEE (Institute of Electrical and Electronics Engineers)
Publication date7 Feb 2024
Pages245-252
Article number10475618
ISBN (Print)979-8-3503-8536-6
ISBN (Electronic)979-8-3503-8535-9
DOIs
Publication statusPublished - 7 Feb 2024
Event2024 IEEE 18th International Conference on Semantic Computing (ICSC) - Laguna Hills, CA, USA
Duration: 5 Feb 20247 Feb 2024

Conference

Conference2024 IEEE 18th International Conference on Semantic Computing (ICSC)
LocationLaguna Hills, CA, USA
Period05/02/202407/02/2024
SeriesIEEE International Conference on Semantic Computing (ICSC)
ISSN2325-6516

Keywords

  • Data Exploration
  • Data Discovery
  • Data Lakes
  • Table search
  • Indexing

Fingerprint

Dive into the research topics of 'A Study on Efficient Indexing for Table Search in Data Lakes'. Together they form a unique fingerprint.

Cite this