Efficient Metric Indexing for Similarity Search and Similarity Joins

Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen

Research output: Contribution to journalJournal articleResearchpeer-review

29 Citations (Scopus)

Abstract

Spatial queries including similarity search and similarity joins are useful in many areas, such as multimedia retrieval, data integration, and so on. However, they are not supported well by commercial DBMSs. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. In this paper, we propose a versatile and efficient disk-based index for metric data, the S pace-filling curve and Pivot-based B+-tree (SPB-tree). This index leverages the B+-tree, and uses space-filling curve to cluster data into compact regions, thus achieving storage efficiency. It utilizes a small set of so-called pivots to reduce significantly the number of distance computations when using the index. Further, it makes use of a separate random access file to support a broad range of data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient algorithms for processing similarity search and similarity joins, as well as corresponding cost models based on SPB-trees. Extensive experiments using both real and synthetic data show that, compared with state-of-the-art competitors, the SPB-tree has much lower construction cost, smaller storage size, and supports more efficient similarity search and similarity joins with high accuracy cost models.

Original languageEnglish
Article number7349200
JournalIEEE Transactions on Knowledge and Data Engineering
Volume29
Issue number3
Pages (from-to)556-571
Number of pages16
ISSN1041-4347
DOIs
Publication statusPublished - 1 Mar 2017

Keywords

  • cost model
  • Indexing technique
  • metric space
  • query processing

Fingerprint

Dive into the research topics of 'Efficient Metric Indexing for Similarity Search and Similarity Joins'. Together they form a unique fingerprint.

Cite this