The paths more taken: matching DOM trees to search logs for accurate webpage clustering

13 years 11 months ago

Download www.cs.cmu.edu

An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on amazon.com), traditional clustering methods typically use some distance function between the DOM trees representing a pair of webpages. However, without knowing which portions of the DOM tree are “important,” such distance functions might discriminate between similar pages based on trivial features (e.g., diﬀering number of reviews on two product pages), or club together distinct types of pages based on superﬁcial features present in the DOM trees of both (e.g., matching footer/copyright), leading to poor clustering performance. We propose using search logs to automatically ﬁnd paths in the DOM trees that mark out important portions of pages, e.g., the product title in a product page. Such paths are identiﬁed via a global analysis of ...

Deepayan Chakrabarti, Rupesh R. Mehta

Real-time Traffic

Dom Trees | Internet Technology | Pages | Product Page | WWW 2010 |

claim paper

Post Info
More Details (n/a)

Added	14 May 2010
Updated	14 May 2010
Type	Conference
Year	2010
Where	WWW
Authors	Deepayan Chakrabarti, Rupesh R. Mehta

Comments (0)

Sciweavers

The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Dom Trees | Internet Technology | Pages | Product Page | WWW 2010 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers