Sciweavers

ACL
2008

Smoothing a Tera-word Language Model

13 years 6 months ago
Smoothing a Tera-word Language Model
Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting.
Deniz Yuret
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where ACL
Authors Deniz Yuret
Comments (0)