Sciweavers

563 search results - page 28 / 113
» Crawling the web for structured documents
Sort
View
WSDM
2010
ACM
204views Data Mining» more  WSDM 2010»
15 years 8 months ago
Learning URL patterns for webpage de-duplication
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we pres...
Hema Swetha Koppula, Krishna P. Leela, Amit Agarwa...
WWW
2010
ACM
15 years 8 months ago
New-web search with microblog annotations
Web search engines discover indexable documents by recursively ‘crawling’ from a seed URL. Their rankings take into account link popularity. While this works well, it introduc...
Tom Rowlands, David Hawking, Ramesh Sankaranarayan...
WWW
2005
ACM
15 years 7 months ago
Finding the boundaries of information resources on the web
In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggreg...
Pavel Dmitriev, Carl Lagoze, Boris Suchkov
102
Voted
ICDAR
2003
IEEE
15 years 7 months ago
Web Page Summarization for Handheld Devices: A Natural Language Approach
Summarization of web pages is a very interesting topic from both academic and commercial point of view. Academically, it is challenging to create a summary of a document (e.g. a w...
Hassan Alam, Rachmat Hartono, Aman Kumar, Ahmad Fu...
ACMICEC
2006
ACM
141views ECommerce» more  ACMICEC 2006»
15 years 7 months ago
From HTML documents to web tables and rules
We present a browser-extending Semantic Web extraction system that maps HTML documents to tables and, where possible, to rules. First, the basic data extractor ViPER distills and ...
Kai Simon, Georg Lausen, Harold Boley