Abstract: Query languages for XML such as XPath or XQuery support Boolean retrieval where a query result is a (possibly restructured) subset of XML elements or entire documents tha...
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
Abstract. In this paper, we describe a new approach to information extraction that neatly integrates top-down hypothesis driven information with bottom-up data driven information. ...
Constrained gradient analysis (similar to the “cubegrade” problem posed by Imielinski, et al. [9]) is to extract pairs of similar cell characteristics associated with big chan...
Guozhu Dong, Jiawei Han, Joyce M. W. Lam, Jian Pei...
We present a general framework for the task of extracting specific information “on demand” from a large corpus such as the Web under resource-constraints. Given a database wit...