Detecting Near-replicas on the Web by Content and Hyperlink Analysis

16 years 6 months ago

Download nautilus.dii.unisi.it

The presence of replicas or near-replicas of documents is very common on the Web. Documents may be replicated completely or partially for different reasons (versions, mirrors, etc.), or the same resource can be associated to different URLs (aliases, dynamically generated pages, etc.). Whilst replication can improve information accessibility by the users, the presence of near-replicated documents can hinder the effectiveness of search engines. For example, users would be annoyed by the presence of many similar pages in the result list in response to a query to a search engine. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. Both signatures are low dimensional vectors in order to reduce the computational costs for comparings pairs of documents. The first signature is obtained by a random projection of the bag-of-words vector representing the page contents. The second signature, referred to as Hypelink Map, is...

Ernesto Di Iorio, Michelangelo Diligenti, Marco Go

Real-time Traffic

Dataset Replicas | Internet Technology | Low Dimensional Vectors | Similar Pages | WWW 2003 |

claim paper

» Web data mining exploring hyperlinks contents and usage data

» Enhanced web document summarization using hyperlinks

» Improved Algorithms for Topic Distillation in a Hyperlinked Environment

» Completing wikipedias hyperlink structure through dimensionality reduction

» The connectivity sonar detecting site functionality by structural patterns

» Extended Link Analysis for Extracting Spatial Information Hubs

» ViDIFF Understanding Web Pages Changes

» Link Spam Detection based on DBSpamClust with Fuzzy Cmeans Clustering

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2003
Where	WWW
Authors	Ernesto Di Iorio, Michelangelo Diligenti, Marco Gori, Marco Maggini, Augusto Pucci

Comments (0)

Sciweavers

Detecting Near-replicas on the Web by Content and Hyperlink Analysis

Dataset Replicas | Internet Technology | Low Dimensional Vectors | Similar Pages | WWW 2003 |

Explore & Download

Productivity Tools

Sciweavers