Sciweavers

ECIR
2003
Springer

Hierarchical Classification of HTML Documents with WebClassII

13 years 6 months ago
Hierarchical Classification of HTML Documents with WebClassII
This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the hierarchy. Moreover, a new measure for the evaluation of system performances has been introduced in order to compare three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets). The method has been implemented in the context of a client-server application, named WebClassII. Results show that for hierarchical techniques it is better to use hierarchical training sets.
Michelangelo Ceci, Donato Malerba
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where ECIR
Authors Michelangelo Ceci, Donato Malerba
Comments (0)