We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious li...
Junfeng Wang, Xiaofei He, Can Wang, Jian Pei, Jiaj...
The two most important tasks in entity information summarization from the Web are named entity recognition and relation extraction. Little work has been done toward an integrated ...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, ...
A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the e...
XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the...
Minos N. Garofalakis, Aristides Gionis, Rajeev Ras...