Document Classification via Structure Synopses

15 years 8 months ago

Download crpit.com

Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a feature vector. Such feature vectors are often generated based on the appearance and frequencies of words in the documents. The high-dimensionality of these feature vectors causes some problems: important clues might be missed out, and the classification might be misled by some trivial elements. In this paper, we propose a method which makes use of structuring conventions to reduce size of the feature vector without affecting the accuracy of the classification process. Effectively, a synopsis of document structure is extracted, which contains only the most informative features; then a succinct feature vector is generated to represent the instance. Finally, a decision tree machine learning al...

Liping Ma, John Shepherd, Anh Nguyen

Real-time Traffic

ADC 2003 | Database | Feature Vector | Succinct Feature Vector | Such Feature Vectors |

claim paper

Added	23 Aug 2010
Updated	23 Aug 2010
Type	Conference
Year	2003
Where	ADC
Authors	Liping Ma, John Shepherd, Anh Nguyen

Sciweavers

Document Classification via Structure Synopses

ADC 2003 | Database | Feature Vector | Succinct Feature Vector | Such Feature Vectors |

Explore & Download

Productivity Tools

Sciweavers