Sciweavers

IDEAS
2008
IEEE

Improved count suffix trees for natural language data

13 years 10 months ago
Improved count suffix trees for natural language data
With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification...
Guido Sautter, Cristina Abba, Klemens Böhm
Added 31 May 2010
Updated 31 May 2010
Type Conference
Year 2008
Where IDEAS
Authors Guido Sautter, Cristina Abba, Klemens Böhm
Comments (0)