Web pages contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Extraction of "use...
Web pages are usually highly structured documents. In some documents, content with different functionality is laid out in blocks, some merely supporting the main discourse. In ot...
Abstract: As web sites are getting more complicated, the construction of web information extraction systems becomes more troublesome and time-consuming. A common theme is the diffi...
This paper presents a web page fragmentation technique, which is utilized for extracting specific parts of web pages and building personalized portals using these fragments. It is ...
Ioannis Misedakis, Vaggelis Kapoulas, Christos Bou...
We describe Thresher, a system that lets non-technical users teach their browsers how to extract semantic web content from HTML documents on the World Wide Web. Users specify exam...