Learning deterministic regular expressions for the inference of schemas from XML data

16 years 8 months ago

Download www2008.org

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirical...

Geert Jan Bex, Wouter Gelade, Frank Neven, Stijn V

Real-time Traffic

Deterministic Regular Expressions | Internet Technology | K-occurrence Regular Expressions | WWW 2008 | XML Schema Definition |

claim paper

» SchemaScope a system for inferring and cleaning XML schemas

» XTRACT A System for Extracting Document Type Descriptors from XML Documents

» Simplifying XML schema effortless handling of nondeterministic regular expressions

» Complexity of Decision Problems for Simple Regular Expressions

» Lifting XML Schema to OWL

» Schema Evolution for XML A ConsistencyPreserving Approach

» Regular expression filters for XML

» StatiX making XML count

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Geert Jan Bex, Wouter Gelade, Frank Neven, Stijn Vansummeren

Comments (0)

Sciweavers

Learning deterministic regular expressions for the inference of schemas from XML data

Deterministic Regular Expressions | Internet Technology | K-occurrence Regular Expressions | WWW 2008 | XML Schema Definition |

Explore & Download

Productivity Tools

Sciweavers