TY - JOUR
T1 - Efficient Metric Indexing for Similarity Search and Similarity Joins
AU - Chen, Lu
AU - Gao, Yunjun
AU - Li, Xinhan
AU - Jensen, Christian S.
AU - Chen, Gang
PY - 2017/3/1
Y1 - 2017/3/1
N2 - Spatial queries including similarity search and similarity joins are useful in many areas, such as multimedia retrieval, data integration, and so on. However, they are not supported well by commercial DBMSs. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. In this paper, we propose a versatile and efficient disk-based index for metric data, the S pace-filling curve and Pivot-based B+-tree (SPB-tree). This index leverages the B+-tree, and uses space-filling curve to cluster data into compact regions, thus achieving storage efficiency. It utilizes a small set of so-called pivots to reduce significantly the number of distance computations when using the index. Further, it makes use of a separate random access file to support a broad range of data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient algorithms for processing similarity search and similarity joins, as well as corresponding cost models based on SPB-trees. Extensive experiments using both real and synthetic data show that, compared with state-of-the-art competitors, the SPB-tree has much lower construction cost, smaller storage size, and supports more efficient similarity search and similarity joins with high accuracy cost models.
AB - Spatial queries including similarity search and similarity joins are useful in many areas, such as multimedia retrieval, data integration, and so on. However, they are not supported well by commercial DBMSs. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. In this paper, we propose a versatile and efficient disk-based index for metric data, the S pace-filling curve and Pivot-based B+-tree (SPB-tree). This index leverages the B+-tree, and uses space-filling curve to cluster data into compact regions, thus achieving storage efficiency. It utilizes a small set of so-called pivots to reduce significantly the number of distance computations when using the index. Further, it makes use of a separate random access file to support a broad range of data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient algorithms for processing similarity search and similarity joins, as well as corresponding cost models based on SPB-trees. Extensive experiments using both real and synthetic data show that, compared with state-of-the-art competitors, the SPB-tree has much lower construction cost, smaller storage size, and supports more efficient similarity search and similarity joins with high accuracy cost models.
KW - cost model
KW - Indexing technique
KW - metric space
KW - query processing
UR - http://www.scopus.com/inward/record.url?scp=85012292964&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2015.2506556
DO - 10.1109/TKDE.2015.2506556
M3 - Journal article
AN - SCOPUS:85012292964
SN - 1041-4347
VL - 29
SP - 556
EP - 571
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 3
M1 - 7349200
ER -