A pattern tree-based approach to learning URL normalization rules

16 years 9 days ago

Download research.microsoft.com

Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted signiﬁcant attention as it is lightweight and can be ﬂexibly integrated into both the online (e.g. crawling) and the oﬄine (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern treebased approach, which is remarkably diﬀerent from existing approaches. Most current approaches learn rewrite rules by iteratively inducing local duplicate pairs to more general forms, and inevitably suﬀer from noisy training data and are practically ineﬃcient. Given a training set of URLs p...

Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodon

Real-time Traffic

Duplicate Urls | Internet Technology | Pattern Tree | URL Normalization | WWW 2010 |

claim paper

» Deduping URLs via rewrite rules

» Autonomous Discovery of Reliable Exception Rules

» Information Extraction as an Ontology Population Task and Its Application to Genic Interac...

» Learning Relational Grammars from Sequences of Actions

Post Info
More Details (n/a)

Added	14 May 2010
Updated	14 May 2010
Type	Conference
Year	2010
Where	WWW
Authors	Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodong Fan, Lei Zhang

Comments (0)

Sciweavers

A pattern tree-based approach to learning URL normalization rules

Duplicate Urls | Internet Technology | Pattern Tree | URL Normalization | WWW 2010 |

Explore & Download

Productivity Tools

Sciweavers