The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from...
Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wa...
The next wave in search technology will be driven by the identification, extraction, and exploitation of real-world entities represented in unstructured textual sources. Search sy...
Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (...
In prior work we have demonstrated that search engine caches and archiving projects like the Internet Archive’s Wayback Machine can be used to “lazily preserve” websites and...
Automatic keyword annotation is a promising solution to enable more effective image search by using keywords. In this paper, we propose a novel automatic image annotation method b...