The Internet makes it possible to share information (e.g. text, image, audio, video and other formats of data) across the globe. In this paper we look at collaborative Internet en...
This paper proposes a new approach to the challenging open-set language detection task. Most state-of-the-art approaches make use of data sources with several out-of-set languages...
Mohamed Faouzi BenZeghiba, Jean-Luc Gauvain, Lori ...
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
: We describe our participation in the TREC 2007 Enterprise track and detail our language modeling-based approaches. For document search, our focus was on estimating a mixture mode...
Master data refers to core business entities a company uses repeatedly across many business processes and systems (such as lists or hierarchies of customers, suppliers, accounts, ...