Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

13 years 4 months ago

Download donelaitis.vdu.lt

This paper investigates a variety of statistical cache-based language models built upon three corpora: English, Lithuanian, and Lithuanian base forms. The impact of the cache size, type of the decay function, including custom corpus derived functions, and interpolation technique (static vs. dynamic) on the perplexity of a language model is studied. The best results are achieved by models consisting of 3 components: standard 3gram, decaying cache 1-gram and decaying cache 2-gram that are joined together by means of linear interpolation using the technique of dynamic weight update. Such a model led up to 36% and 43% perplexity improvement with respect to the 3-gram baseline for Lithuanian words and Lithuanian word base forms respectively. The best language model of English led up to a 16% perplexity improvement. This suggests that cache-based modeling is of greater utility for the free word order highly inflected languages.

Airenas Vaiciunas, Gailius Raskinis

Real-time Traffic

Base Forms | INFORMATICALT 2006 | Language Model | Perplexity Improvement |

claim paper

Added	13 Dec 2010
Updated	13 Dec 2010
Type	Journal
Year	2006
Where	INFORMATICALT
Authors	Airenas Vaiciunas, Gailius Raskinis

Sciweavers

Cache-based Statistical Language Models of English and Highly Inflected Lithuanian

Base Forms | INFORMATICALT 2006 | Language Model | Perplexity Improvement |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers