Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we pres...
Hema Swetha Koppula, Krishna P. Leela, Amit Agarwa...
Web search engines discover indexable documents by recursively ‘crawling’ from a seed URL. Their rankings take into account link popularity. While this works well, it introduc...
Tom Rowlands, David Hawking, Ramesh Sankaranarayan...
In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggreg...
Summarization of web pages is a very interesting topic from both academic and commercial point of view. Academically, it is challenging to create a summary of a document (e.g. a w...
Hassan Alam, Rachmat Hartono, Aman Kumar, Ahmad Fu...
We present a browser-extending Semantic Web extraction system that maps HTML documents to tables and, where possible, to rules. First, the basic data extractor ViPER distills and ...