Sciweavers

24 search results - page 4 / 5
» DOM-based content extraction of HTML documents
Sort
View
TREC
2008
13 years 7 months ago
IIT Kharagpur at TREC 2008 Blog Track
This paper describes our opinion retrieval system for TREC 2008 blog track. We focused on five different aspects of the system. The first module is focussed on extracting the blog...
Robin Anil, Sudeshna Sarkar
DOCENG
2009
ACM
14 years 1 days ago
Deriving image-text document surrogates to optimize cognition
The representation of information collections needs to be optimized for human cognition. While documents often include rich visual components, collections, including personal coll...
Eunyee Koh, Andruid Kerne
WWW
2006
ACM
14 years 6 months ago
Relaxed: on the way towards true validation of compound documents
To maintain interoperability in the Web environment it is necessary to comply with Web standards. Current specifications of HTML and XHTML languages define conformance conditions ...
Jirka Kosek, Petr Nálevka
WWW
2009
ACM
14 years 6 months ago
Extracting article text from the web with maximum subsequence segmentation
Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content...
Jeff Pasternack, Dan Roth
ITCC
2005
IEEE
13 years 11 months ago
Elimination of Redundant Information for Web Data Mining
These days, billions of Web pages are created with HTML or other markup languages. They only have a few uniform structures and contain various authoring styles compared to traditi...
Shakirah Mohd Taib, Soon-ja Yeom, Byeong Ho Kang