TY - JOUR
T1 - Comparative Analysis of Indexing Techniques for Table Search in Data Lakes
AU - Taha, Ibraheem
AU - Lissandrini, Matteo
AU - Simitsis, Alkis
AU - Ioannidis, Yannis
PY - 2025/5/14
Y1 - 2025/5/14
N2 - Data lakes store vast amount of datasets of various forms collected from various sources. In this context, efficient table search is essential for identifying and integrating data to support business intelligence and machine learning pipelines. This paper explores effective methods for finding related tables using advanced table representation learning. Representation learning generates dense vector representations for tables at different levels (row, column, cell), enabling the use of advanced indexing techniques such as LSH, HNSW, and DiskANN, which speed up the core operation of approximate k-NN search within vector spaces. However, while several indexing techniques have been proposed so far, a thorough study and comparison of their effectiveness versus performance trade-offs is still missing. In this paper, we aim at shedding light on this gap. We begin by reviewing advanced vector-search techniques for table search in data lakes, followed by a detailed analysis of k-ANN indexes. Next, we present a comparison of the HNSW and DiskANN indexing techniques, comparing their internal structure, effectiveness, efficiency, and scalability. Additionally, we explore the impact of model accuracy on index performance. Our experiments include four datasets of various sizes and complexity. This study allows us to explore indexing design options, revealing the strengths and weaknesses of each, and also to identify potentially interesting future research directions.
AB - Data lakes store vast amount of datasets of various forms collected from various sources. In this context, efficient table search is essential for identifying and integrating data to support business intelligence and machine learning pipelines. This paper explores effective methods for finding related tables using advanced table representation learning. Representation learning generates dense vector representations for tables at different levels (row, column, cell), enabling the use of advanced indexing techniques such as LSH, HNSW, and DiskANN, which speed up the core operation of approximate k-NN search within vector spaces. However, while several indexing techniques have been proposed so far, a thorough study and comparison of their effectiveness versus performance trade-offs is still missing. In this paper, we aim at shedding light on this gap. We begin by reviewing advanced vector-search techniques for table search in data lakes, followed by a detailed analysis of k-ANN indexes. Next, we present a comparison of the HNSW and DiskANN indexing techniques, comparing their internal structure, effectiveness, efficiency, and scalability. Additionally, we explore the impact of model accuracy on index performance. Our experiments include four datasets of various sizes and complexity. This study allows us to explore indexing design options, revealing the strengths and weaknesses of each, and also to identify potentially interesting future research directions.
KW - Data Discovery
KW - Data Exploration
KW - Data Lakes
KW - Table Search
UR - http://www.scopus.com/inward/record.url?scp=105004918731&partnerID=8YFLogxK
U2 - 10.1142/S1793351X25420024
DO - 10.1142/S1793351X25420024
M3 - Journal article
SN - 1793-7108
VL - 19
SP - 1
EP - 24
JO - International Journal of Semantic Computing
JF - International Journal of Semantic Computing
IS - 2
ER -