A Study on Efficient Indexing for Table Search in Data Lakes

Ibraheem Taha, Matteo Lissandrini, Alkis Simitsis, Yannis Ioannidis

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

1 Citationer (Scopus)

Abstract

Data lakes store diverse and large volumes of datasets. One of the core challenges in data lakes is dataset discovery, which involves tasks such as finding related tables, domain discovery, and column clustering. In this paper, we focus on a popular approach for finding related tables in public or private data lakes, namely table search. Given the heterogeneity of the tables in a data lake, recent methods adopt table-representation learning and produce dense vector representations for every row, column, or even cell value. This enables advanced indexing techniques, such as HSNW, LSH, and DiskANN, which implement efficient data-structures to speed-up the core operation of approximate k-NN search in such vector spaces. However, while many indexing techniques have been employed so far, their practical value and effectiveness governed by the tradeoff of accuracy vs. performance have not been explored yet. In this paper, we aim at shedding light on this gap. We start with an overview of state-of-the-art techniques for table search in data lakes that are based on vector-search operations. Then, we present an in-depth analysis of the performances of the k-ANN indexes and techniques they adopt. This allows us to map for the first time the space of alternative implementations for these techniques when applied to data lakes, revealing strengths and weaknesses of each option, and further delineating exciting novel research directions.
OriginalsprogEngelsk
TitelProceedings - 18th IEEE International Conference on Semantic Computing, ICSC 2024
Antal sider8
ForlagIEEE (Institute of Electrical and Electronics Engineers)
Publikationsdato7 feb. 2024
Sider245-252
Artikelnummer10475618
ISBN (Trykt)979-8-3503-8536-6
ISBN (Elektronisk)979-8-3503-8535-9
DOI
StatusUdgivet - 7 feb. 2024
Begivenhed2024 IEEE 18th International Conference on Semantic Computing (ICSC) - Laguna Hills, CA, USA
Varighed: 5 feb. 20247 feb. 2024

Konference

Konference2024 IEEE 18th International Conference on Semantic Computing (ICSC)
LokationLaguna Hills, CA, USA
Periode05/02/202407/02/2024
NavnIEEE International Conference on Semantic Computing (ICSC)
ISSN2325-6516

Fingeraftryk

Dyk ned i forskningsemnerne om 'A Study on Efficient Indexing for Table Search in Data Lakes'. Sammen danner de et unikt fingeraftryk.

Citationsformater