Sciweavers

PAKM
2008

Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features

13 years 6 months ago
Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features
With a rich variety of forms and types, digital resources are complex data objects. They grows fast in volume on the Web, but hard to be classified efficiently. The paper presents a practical classification solution using features from file names and extensions of digital resources. The features are easy to get and common to all resource. But they are generally low frequency and sparse, which implies that statistical approach may not work well. Our solution combines Naive Bayes (NB) classifier with Simple Good-Turing (SGT) probability estimation, which shows great promise for this condition with a total accuracy of 80%. In our opinion, the results are due to 1) the features fit the NB's conditional independence hypothesis well; 2) the abound one-timeoccurrence features lead to reasonable probability estimation on unobserved features, which also means general feature selection strategy is not needed in this case. A 7.4TB digital resource collection, CDAL, is used to train and evalu...
Chong Chen, Hongfei Yan, Xiaoming Li
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2008
Where PAKM
Authors Chong Chen, Hongfei Yan, Xiaoming Li
Comments (0)