With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we ...
Chuan Xiao, Wei Wang 0011, Xuemin Lin, Jeffrey Xu ...
Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch...
The success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies, have ...
Broder et al.’s [3] shingling algorithm and Charikar’s [4] random projection based approach are considered “state-of-theart” algorithms for finding near-duplicate web pag...
Most of the current algorithms for finding related pages are exclusively based on text corpora of the WWW or incorporate only authority or hub values of pages. In this paper, we ...
Paul-Alexandru Chirita, Daniel Olmedilla, Wolfgang...