Sciweavers

ACL
2009

Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples

13 years 2 months ago
Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples
Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language. Quantities such as average word length are stable across sample sizes and hence can be reliably estimated from large enough samples. However, quantities such as vocabulary size change with sample size. Thus measurements based on a given sample will need to be extrapolated to obtain their estimates over larger unseen samples. In this work, we propose a novel nonparametric estimator of vocabulary size. Our main result is to show the statistical consistency of the estimator
Suma Bhat, Richard Sproat
Added 16 Feb 2011
Updated 16 Feb 2011
Type Journal
Year 2009
Where ACL
Authors Suma Bhat, Richard Sproat
Comments (0)