Where and How Duplicates Occur in the Web

13 years 10 months ago

Download homepages.dcc.ufmg.br

In this paper we study duplicates on the Web, using collections containing documents of all sites under the .cl domain that represent accurate and representative subsets of the Web. We identify duplicate and near-duplicate documents in our collections, studying the distribution of documents in clusters of duplicates. We also study the occurrence of duplicates in both parts of our Web graphs – connected and disconnected component – aiming to identify where duplicates occur more frequently. We originally show that the number of duplicates in the Web is expressively greater than the number of duplicates in the connected component of the Web graph. Works that previously estimated the number of duplicates in the Web used collections of connected components of the Web. In those cases the sample of the Web was biased.

Álvaro R. Pereira Jr., Ricardo A. Baeza-Yat

Real-time Traffic

.cl Domain | Collections Containing Documents | Human Computer Interaction | Internet Technology | LAWEB 2006 | Web Graph |

claim paper

» Supporting collaborative interpretation in distributed Groupware

» An Ontological Framework for Dynamic Coordination

» Query Processing with Materialized Views in a Traceable P2P Record Exchange Framework

» Web resource geographic location classification and detection

» Validation of an NSPbased negative selection pattern gene family identification strategy

» Discovering Test Set Regularities in Relational Domains

» Beyond SinglePage Web Search Results

» Designing WebBased Interactive Learning Environments for ProblemBased Learning

Post Info
More Details (n/a)

Added	12 Jun 2010
Updated	12 Jun 2010
Type	Conference
Year	2006
Where	LAWEB
Authors	Álvaro R. Pereira Jr., Ricardo A. Baeza-Yates, Nivio Ziviani

Comments (0)

Sciweavers

Where and How Duplicates Occur in the Web

.cl Domain | Collections Containing Documents | Human Computer Interaction | Internet Technology | LAWEB 2006 | Web Graph |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers