Mining the Web to Create Minority Language Corpora

13 years 10 months ago
Mining the Web to Create Minority Language Corpora
The Web is a valuable source of language speci c resources but the process of collecting, organizing and utilizing these resources is di cult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It di ers from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classi er as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to nd inclusion/exclusion terms that are helpful for retrieving documents in the target language and nd that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages...
Rayid Ghani, Rosie Jones, Dunja Mladenic
Added 28 Jul 2010
Updated 28 Jul 2010
Type Conference
Year 2001
Where CIKM
Authors Rayid Ghani, Rosie Jones, Dunja Mladenic
Comments (0)