: The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce alg...
Ralf Schenkel, Fabian M. Suchanek, Gjergji Kasneci
Abstract. This paper describes a methodology for constructing aligned German-Chinese corpora from movie subtitles. The corpora will be used to train a special machine translation s...
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work i...
In this paper the transcription and evaluation of the corpus DIMEx100 for Mexican Spanish is presented. First we describe the corpus and explain the linguistic and computational mo...
Luis Alberto Pineda, Hayde Castellanos, Javier Cu&...
— A key step in validating a proposed idea or system is to evaluate over a suitable data set. However, to this date there have been no useful tools for researchers to understand ...
Meiyu Lu, Srinivas Bangalore, Graham Cormode, Mari...