Sciweavers

DOCENG
2004
ACM

Supervised learning for the legacy document conversion

13 years 10 months ago
Supervised learning for the legacy document conversion
We consider the problem of document conversion from the renderingoriented HTML markup into a semantic-oriented XML annotation defined by user-specific DTDs or XML Schema descriptions. We represent both source and target documents as rooted ordered trees so the conversion can be achieved by applying a set of tree transformations. We apply the supervised learning framework to the conversion task according to which the tree transformations are learned from a set of training examples. We develop a two-step approach to the conversion problem, that first labels the leaves in the source trees and then recomposes the target trees from the leaf labels. We present two solutions based of the leaf classification with the target terminals and paths. Moreover, we develop three methods for the leaf classification. All methods and solutions have been tested on two real collections.
Boris Chidlovskii, Jérôme Fuselier
Added 30 Jun 2010
Updated 30 Jun 2010
Type Conference
Year 2004
Where DOCENG
Authors Boris Chidlovskii, Jérôme Fuselier
Comments (0)