Set Similarity Join on Probabilistic Data

8 years 9 months ago
Set Similarity Join on Probabilistic Data
Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS2 J) over two probabilistic set databases, under the possible worlds semantics. To efficiently process the PS2 J operator, we first reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can filter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS2 J processing performance on both real and synthetic data.
Xiang Lian, Lei Chen 0002
Added 30 Jan 2011
Updated 30 Jan 2011
Type Journal
Year 2010
Authors Xiang Lian, Lei Chen 0002
Comments (0)