Sciweavers

ICISC
2001
162views Cryptology» more  ICISC 2001»
13 years 6 months ago
Content Extraction Signatures
Motivated by emerging needs in online interactions, we define a new type of digital signature called a `Content Extraction Signature' (CES). A CES allows the owner, Bob, of a...
Ron Steinfeld, Laurence Bull, Yuliang Zheng
IIWAS
2008
13 years 6 months ago
Combining content extraction heuristics: the CombinE system
The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Conte...
Thomas Gottron
DRR
2008
13 years 6 months ago
Segmentation-based retrieval of document images from diverse collections
We describe a methodology for retrieving document images from large extremely diverse collections. First we perform content extraction, that is the location and measurement of reg...
Michael A. Moll, Henry S. Baird
LREC
2010
169views Education» more  LREC 2010»
13 years 6 months ago
An Evaluation of Technologies for Knowledge Base Population
Previous content extraction evaluations have neglected to address problems which complicate the incorporation of extracted information into an existing knowledge base. Previous qu...
Paul McNamee, Hoa Trang Dang, Heather Simpson, Pat...
AINA
2009
IEEE
13 years 11 months ago
Learning to Extract Content from News Webpages
We consider the problem of content extraction from online news webpages. To explore to what extent the syntactic markup and the visual structure of a webpage facilitate the extrac...
Alex Spengler, Patrick Gallinari
WWW
2010
ACM
13 years 11 months ago
CETR: content extraction via tag ratios
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
Tim Weninger, William H. Hsu, Jiawei Han
WWW
2003
ACM
14 years 5 months ago
DOM-based content extraction of HTML documents
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction o...
Suhit Gupta, Gail E. Kaiser, David Neistadt, Peter...
WWW
2005
ACM
14 years 5 months ago
Extracting context to improve accuracy for HTML content extraction
Web pages contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Extraction of "use...
Suhit Gupta, Gail E. Kaiser, Salvatore J. Stolfo