Abstract
Nearest neighbor (NN) search in high-dimensional spaces is
inherently computationally expensive due to the curse of dimensionality. As a well-known solution to approximate NN
search, locality-sensitive hashing (LSH) is able to answer
c-approximate NN (c-ANN) queries in sublinear time with
constant probability. Existing LSH methods focus mainly
on building hash bucket based indexing such that the candidate points can be retrieved quickly. However, existing
coarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additional
computational overhead when having to examine unnecessary points. This in turn reduces the performance of query
processing. In contrast, we propose a fast and accurate LSH
framework, called PM-LSH, that aims to compute the cANN query on large- scale, high-dimensional datasets. First,
we adopt a simple yet effective PM-tree to index the data
points. Second, we develop a tunable confidence interval
to achieve accurate distance estimation and guarantee high
result quality. Third, we propose an efficient algorithm on
top of the PM-tree to improve the performance of computing c-ANN queries. Extensive experiments with real-world
data offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency and
accuracy.
inherently computationally expensive due to the curse of dimensionality. As a well-known solution to approximate NN
search, locality-sensitive hashing (LSH) is able to answer
c-approximate NN (c-ANN) queries in sublinear time with
constant probability. Existing LSH methods focus mainly
on building hash bucket based indexing such that the candidate points can be retrieved quickly. However, existing
coarse-grained structures fail to offer accurate distance estimation for candidate points, which translates into additional
computational overhead when having to examine unnecessary points. This in turn reduces the performance of query
processing. In contrast, we propose a fast and accurate LSH
framework, called PM-LSH, that aims to compute the cANN query on large- scale, high-dimensional datasets. First,
we adopt a simple yet effective PM-tree to index the data
points. Second, we develop a tunable confidence interval
to achieve accurate distance estimation and guarantee high
result quality. Third, we propose an efficient algorithm on
top of the PM-tree to improve the performance of computing c-ANN queries. Extensive experiments with real-world
data offer evidence that PM-LSH is capable of outperforming existing proposals with respect to both efficiency and
accuracy.
Originalsprog | Engelsk |
---|---|
Tidsskrift | Proceedings of the VLDB Endowment |
Vol/bind | 13 |
Udgave nummer | 5 |
Sider (fra-til) | 643-655 |
Antal sider | 13 |
ISSN | 2150-8097 |
DOI | |
Status | Udgivet - 2020 |