Sciweavers

CIKM
2005
Springer

Fast webpage classification using URL features

13 years 10 months ago
Fast webpage classification using URL features
We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is magnitudes faster than typical web page classification, as the pages themselves do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting binary features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness in binary, multi-class and hierarchical classification. Our results show that, in certain scenarios, URL-based methods approach and sometime exceeds the performance of full-text and link-based methods. We also use these features to predict the prestige of a webpage (as modeled by Pagerank), and show that it can be predicted with an average error of less than one point (on a ten-point scale) in a topical set of web pages. Categories and Subject Descriptors H.3.1 [Information Storage a...
Min-Yen Kan, Hoang Oanh Nguyen Thi
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where CIKM
Authors Min-Yen Kan, Hoang Oanh Nguyen Thi
Comments (0)