The Web is a valuable source of language speci c resources but the process of collecting, organizing and utilizing these resources is di cult. We describe CorpusBuilder, an approa...
Proxy caches have become a central mechanism for reducing the latency of web document retrieval. While caching alone reduces latency for previously requested documents, web docume...
XML documents represent a middle range between unstructured data such as textual documents and fully structured data encoded in databases. Typically, information retrieval techniq...
Yosi Mass, Dafna Sheinwald, Benjamin Sznajder, Siv...
The presentation of search results on the web has been dominated by the textual form of document representation. On the other hand, the document's visual aspects such as the ...
Document classification is a key task for many text mining applications. However, traditional text classification requires labeled data to construct reliable and accurate classifie...