Towards practical genre classification of web documents

14 years 6 months ago
Towards practical genre classification of web documents
Classification of documents by genre is typically done either using linguistic analysis or term frequency based techniques. The former provides better classification accuracy than the latter but at the cost of two orders of magnitude more computation time. While term frequency analysis requires much less computational resources than linguistic analysis, it returns poor classification accuracy when the genres are not sufficiently distinct. A method that removes or approximates the expensive portions of linguistic analysis is presented. The accuracy and computation time of this method is then compared with both linguistic analysis and term frequency analysis. The results in this paper show that this method can significantly reduce the computation of both time of linguistic analysis and term frequency analysis, while retaining an accuracy that is higher than that of term frequency analysis. Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing--T...
George Ferizis, Peter Bailey
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2006
Where WWW
Authors George Ferizis, Peter Bailey
Comments (0)