Randomised Language Modelling for Statistical Machine Translation

10 years 7 months ago
Randomised Language Modelling for Statistical Machine Translation
A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements are significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we explore the use of BFs for language modelling in statistical machine translation. We show how a BF containing n-grams can enable us to use much larger corpora and higher-order models complementing a conventional n-gram LM within an SMT system. We also consider (i) how to include approximate frequency information efficiently within a BF and (ii) how to reduce the error rate of these models by first checking for lower-order sub-sequences in candidate ngrams. Our solutions in both cases retain the one-sided error guarantees of the BF while taking advantage of the Zipf-like distribution of word frequencies to reduce the space requirements.
David Talbot, Miles Osborne
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Where ACL
Authors David Talbot, Miles Osborne
Comments (0)