The correct web site text content must be help to the visitors to find what they are looking for. However, the reality is quite different, many times the web page text content is a...
Web pages contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Extraction of "use...
Web object is defined to represent any meaningful object embedded in web pages (e.g. images, music) or pointed to by hyperlinks (e.g. downloadable files). Users usually search for...
Detection of template and noise blocks in web pages is an important step in improving the performance of information retrieval and content extraction. Of the many approaches propos...
Organizations like the Internet Archive have been capturing Web contents over decades, building up huge repositories of time-versioned pages. The timestamp annotations and the she...
Gerhard Weikum, Nikos Ntarmos, Marc Spaniol, Peter...