Web pages contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Extraction of "use...
Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies...
We present a novel approach to recognizing Textual nt. Structural features are constructed from abstract tree descriptions, which are automatically extracted from syntactic depend...
— The number of XML documents produced and available on the Internet is steadily increasing. It is thus important to devise automatic procedures to extract useful information fro...
Francesca Trentini, Markus Hagenbuchner, Alessandr...
While several hierarchical classification methods have been applied to web content, such techniques invariably rely on a pre-defined taxonomy of documents. We propose a new techni...