Sciweavers

ICDAR
2009
IEEE

Document Content Extraction Using Automatically Discovered Features

13 years 2 months ago
Document Content Extraction Using Automatically Discovered Features
We report an automatic feature discovery method that achieves results comparable to a manually chosen, larger feature set on a document image content extraction problem: the location and segmentation of regions containing handwriting and machine-printed text in documents images. As first detailed in [17], this approach is a greedy forward selection algorithm that iteratively constructs one linear feature at a time. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. We conducted experiments on 87 diverse test images. Four manually chosen linear features with an error rate of 16.2% were given to the algorithm; the algorithm then found an additional ten features; the composite 14 features achieved an error rate of 13.8%. This outperforms a feature set of size 14 chosen by Principal Component Analysis (PCA) with an error rate ...
Sui-Yu Wang, Henry S. Baird, Chang An
Added 18 Feb 2011
Updated 18 Feb 2011
Type Journal
Year 2009
Where ICDAR
Authors Sui-Yu Wang, Henry S. Baird, Chang An
Comments (0)