PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Bolong Zheng*, Xi Zhao, Lianggui Weng, Quoc Viet Hung Nguyen, Hang Liu, Christian S. Jensen

*Corresponding author for this work

Research output: Contribution to journalConference article in JournalResearchpeer-review

6 Citations (Scopus)
25 Downloads (Pure)

Abstract

Nearest neighbor (NN) search is inherently computationally expensive in high-dimensional spaces due to the curse of dimensionality. As a well-known solution, locality-sensitive hashing (LSH) is able to answer c-approximate NN (c-ANN) queries in sublinear time with constant probability. Existing LSH methods focus mainly on building hash bucket-based indexing such that the candidate points can be retrieved quickly. However, existing coarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additional computational overhead when having to examine unnecessary points. This in turn reduces the performance of query processing. In contrast, we propose a fast and accurate in-memory LSH framework, called PM-LSH, that aims to compute the c-ANN query on large-scale, high-dimensional datasets. First, we adopt a simple yet effective PM-tree to index the data points. Second, we develop a tunable confidence interval to achieve accurate distance estimation and guarantee high result quality. Third, we propose an efficient algorithm on top of the PM-tree to improve the performance of computing c-ANN queries. In addition, we extend PM-LSH to support closest pair (CP) search in high-dimensional spaces. Here, we again adopt the PM-tree to organize the points in a low-dimensional space, and we propose a branch and bound algorithm together with a radius pruning technique to improve the performance of computing c-approximate closest pair (c-ACP) queries. Extensive experiments with real-world data offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency and accuracy for both NN and CP search.

Original languageEnglish
JournalVLDB Journal
Volume31
Issue number6
Pages (from-to)1339-1363
Number of pages25
ISSN1066-8888
DOIs
Publication statusPublished - Nov 2022
Event16th International Workshop on Data Management on New Hardware - Virtual
Duration: 15 Jun 202015 Jun 2020
https://sites.google.com/view/damon2020/

Workshop

Workshop16th International Workshop on Data Management on New Hardware
LocationVirtual
Period15/06/202015/06/2020
Internet address

Bibliographical note

Funding Information:
This research is supported in part by the NSFC (Grants No. 61902134, 62011530437), the Hubei Natural Science Foundation (Grant No. 2020CFB871), and the Fundamental Research Funds for the Central Universities (HUST: Grants No. 2019kfyXKJC021, 2019kfyXJJS091).

Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.

Keywords

  • Approximate nearest neighbor
  • Closest pair
  • High-dimensional data
  • LSH

Fingerprint

Dive into the research topics of 'PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search'. Together they form a unique fingerprint.

Cite this