Abstract. For the 2008 participation at GeoCLEF, we focused on improving the extraction of geographic signatures from documents and optimising their use for GIR. The results show t...
Challenging the implicit reliance on document collections, this paper discusses the pros and cons of using query logs rather than document collections, as self-contained sources o...
XML is becoming a prevalent format for data exchange. Many XML documents have complex schemas that are not always known, and can vary widely between information sources and applica...
Eugene Agichtein, C. T. Howard Ho, Vanja Josifovsk...
We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the ...
Two-dimensional (2-D) plots in digital documents contain important information. Often, the results of scientific experiments and performance of businesses are summarized using pl...
Xiaonan Lu, James Ze Wang, Prasenjit Mitra, C. Lee...