Sciweavers

ADC
2003
Springer

Document Classification via Structure Synopses

13 years 8 months ago
Document Classification via Structure Synopses
Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a feature vector. Such feature vectors are often generated based on the appearance and frequencies of words in the documents. The high-dimensionality of these feature vectors causes some problems: important clues might be missed out, and the classification might be misled by some trivial elements. In this paper, we propose a method which makes use of structuring conventions to reduce size of the feature vector without affecting the accuracy of the classification process. Effectively, a synopsis of document structure is extracted, which contains only the most informative features; then a succinct feature vector is generated to represent the instance. Finally, a decision tree machine learning al...
Liping Ma, John Shepherd, Anh Nguyen
Added 23 Aug 2010
Updated 23 Aug 2010
Type Conference
Year 2003
Where ADC
Authors Liping Ma, John Shepherd, Anh Nguyen
Comments (0)