Sciweavers

WWW
2005
ACM

Thresher: automating the unwrapping of semantic content from the World Wide Web

14 years 5 months ago
Thresher: automating the unwrapping of semantic content from the World Wide Web
We describe Thresher, a system that lets non-technical users teach their browsers how to extract semantic web content from HTML documents on the World Wide Web. Users specify examples of semantic content by highlighting them in a web browser and describing their meaning. We then use the tree edit distance between the DOM subtrees of these examples to create a general pattern, or wrapper, for the content, and allow the user to bind RDF classes and predicates to the nodes of these wrappers. By overlaying matches to these patterns on standard documents inside the Haystack semantic web browser, we enable a rich semantic interaction with existing web pages, "unwrapping" semantic data buried in the pages' HTML. By allowing end-users to create, modify, and utilize their own patterns, we hope to speed adoption and use of the Semantic Web and its applications. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services; H.5.2 [Inform...
Andrew Hogue, David R. Karger
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2005
Where WWW
Authors Andrew Hogue, David R. Karger
Comments (0)