Concentration Bounds for Unigrams Language Model

13 years 10 months ago

Download www.math.tau.ac.il

Abstract. We show several PAC-style concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a PAC bound of approximately O k√ m . We improve its dependency on k to O 4√ k√ m + k m . We also analyze the empirical frequencies estimator, showing that its PAC error bound is approximately O 1 k + √ k m . We derive a combined estimator, which has an error of approximately O m− 2 5 , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. We show that the leave-one-out method can be used for estimating the log-loss of the unigrams model with a PAC error of approximately O 1√ m , for any distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution.

Evgeny Drukh, Yishay Mansour

Real-time Traffic

COLT 2004 | Empirical Frequencies Estimator | PAC Error | PAC Error Bound |

claim paper

Added	01 Jul 2010
Updated	01 Jul 2010
Type	Conference
Year	2004
Where	COLT
Authors	Evgeny Drukh, Yishay Mansour

Sciweavers

Concentration Bounds for Unigrams Language Model

COLT 2004 | Empirical Frequencies Estimator | PAC Error | PAC Error Bound |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers