Sciweavers

INFORMATICALT
2006

Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

13 years 4 months ago
Cache-based Statistical Language Models of English and Highly Inflected Lithuanian
This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consisting of 3 components: standard 3gram, decaying cache 1-gram and decaying cache 2-gram that are joined together by means of linear interpolation using the technique of dynamic weight update. Such a model led up to 36% and 43% perplexity improvement with respect to the 3-gram baseline for Lithuanian words and Lithuanian word base forms respectively. The best language model of English led up to a 16% perplexity improvement. This suggests that cache-based modeling is of greater utility for the free word order highly inflected languages.
Airenas Vaiciunas, Gailius Raskinis
Added 13 Dec 2010
Updated 13 Dec 2010
Type Journal
Year 2006
Where INFORMATICALT
Authors Airenas Vaiciunas, Gailius Raskinis
Comments (0)