Comparative Analysis of Indexing Techniques for Table Search in Data Lakes

Ibraheem Taha, Matteo Lissandrini, Alkis Simitsis, Yannis Ioannidis

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

Data lakes store vast amount of datasets of various forms collected from various sources. In this context, efficient table search is essential for identifying and integrating data to support business intelligence and machine learning pipelines. This paper explores effective methods for finding related tables using advanced table representation learning. Representation learning generates dense vector representations for tables at different levels (row, column, cell), enabling the use of advanced indexing techniques such as LSH, HNSW, and DiskANN, which speed up the core operation of approximate k-NN search within vector spaces. However, while several indexing techniques have been proposed so far, a thorough study and comparison of their effectiveness versus performance trade-offs is still missing. In this paper, we aim at shedding light on this gap. We begin by reviewing advanced vector-search techniques for table search in data lakes, followed by a detailed analysis of k-ANN indexes. Next, we present a comparison of the HNSW and DiskANN indexing techniques, comparing their internal structure, effectiveness, efficiency, and scalability. Additionally, we explore the impact of model accuracy on index performance. Our experiments include four datasets of various sizes and complexity. This study allows us to explore indexing design options, revealing the strengths and weaknesses of each, and also to identify potentially interesting future research directions.

Original languageEnglish
JournalInternational Journal of Semantic Computing
Volume19
Issue number2
Pages (from-to)1-24
Number of pages24
ISSN1793-7108
DOIs
Publication statusPublished - 14 May 2025

Keywords

  • Data Discovery
  • Data Exploration
  • Data Lakes
  • Table Search

Fingerprint

Dive into the research topics of 'Comparative Analysis of Indexing Techniques for Table Search in Data Lakes'. Together they form a unique fingerprint.

Cite this