Sciweavers

HICSS
2009
IEEE

An N-Gram Based Approach to Automatically Identifying Web Page Genre

13 years 10 months ago
An N-Gram Based Approach to Automatically Identifying Web Page Genre
The research reported in this paper is the first phase of a larger project on the automatic classification of web pages by their genres, using ngram representations of the web pages. In this study, the textual content of web pages is used to create feature sets consisting of the most frequent n-grams and their associated frequencies. We present three methods, each of which uses a distance measure to determine the dissimilarity between two feature sets. Each method forms a feature set for every web page in the test set, however the formation of feature sets from the training set differs between methods: we experiment using one feature set per web page, per genre, and a combination of genre-based feature sets supplemented by subgenre feature sets. We present results for a balanced corpus of seven genres (blog, eshop, FAQs, front page, listing, home page, and search page). Initial results are encouraging.
Jane E. Mason, Michael A. Shepherd, Jack Duffy
Added 19 May 2010
Updated 19 May 2010
Type Conference
Year 2009
Where HICSS
Authors Jane E. Mason, Michael A. Shepherd, Jack Duffy
Comments (0)