Sciweavers

LREC
2010

How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method

13 years 6 months ago
How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
We investigate the impact of input data scale in corpus-based learning using a study style of Zipf's law. In our research, Chinese word segmentation is chosen as the study case and a series of experiments are specially conducted for it, in which two types of segmentation techniques, statistical learning and rule-based methods, are examined. The empirical results show that a linear performance improvement in statistical learning requires an exponential increasing of training corpus size at least. As for the rule-based method, an approximate negative inverse relationship between the performance and the size of the input lexicon can be observed.
Hai Zhao, Yan Song, Chunyu Kit
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Hai Zhao, Yan Song, Chunyu Kit
Comments (0)