Adaptive Top-k Overlap Set Similarity Joins

Zhong Yang, Bolong Zheng, Guohui Li, Zhao Xi, Xiaofang Zhou, Christian S. Jensen

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

Abstract

The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.

Original languageEnglish
Title of host publication2020 IEEE 36th International Conference on Data Engineering
Number of pages12
PublisherIEEE
Publication date2020
Pages1081-1092
Article number9101864
ISBN (Print)978-1-7281-2904-4
ISBN (Electronic)9781728129037
DOIs
Publication statusPublished - 2020
Event36th IEEE International Conference on Data Engineering - Dallas, United States
Duration: 20 Apr 202024 Apr 2020
https://www.utdallas.edu/icde/

Conference

Conference36th IEEE International Conference on Data Engineering
CountryUnited States
CityDallas
Period20/04/202024/04/2020
Internet address
SeriesProceedings of the International Conference on Data Engineering
ISSN1063-6382

Keywords

  • Overlap set similarity
  • Similarity join
  • Top-k join, data lake

Fingerprint Dive into the research topics of 'Adaptive Top-k Overlap Set Similarity Joins'. Together they form a unique fingerprint.

Cite this