ADtrees for sequential data and n-gram Counting

10 years 7 months ago
ADtrees for sequential data and n-gram Counting
Abstract— We consider the problem of efficiently storing ngram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the na¨ıve approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.
Robert Van Dam, Dan Ventura
Added 04 Jun 2010
Updated 04 Jun 2010
Type Conference
Year 2007
Where SMC
Authors Robert Van Dam, Dan Ventura
Comments (0)