Document Content Extraction Using Automatically Discovered Features

13 years 2 months ago

Download www.cse.lehigh.edu

We report an automatic feature discovery method that achieves results comparable to a manually chosen, larger feature set on a document image content extraction problem: the location and segmentation of regions containing handwriting and machine-printed text in documents images. As first detailed in [17], this approach is a greedy forward selection algorithm that iteratively constructs one linear feature at a time. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. We conducted experiments on 87 diverse test images. Four manually chosen linear features with an error rate of 16.2% were given to the algorithm; the algorithm then found an additional ten features; the composite 14 features achieved an error rate of 13.8%. This outperforms a feature set of size 14 chosen by Principal Component Analysis (PCA) with an error rate ...

Sui-Yu Wang, Henry S. Baird, Chang An

Real-time Traffic