Sciweavers

ACL
2012

Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit -

11 years 7 months ago
Tokenization: Returning to a Long Solved Problem - A Survey, Contrastive Experiment, Recommendations, and Toolkit -
We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genreor domain-specific idiosyncrasies).
Rebecca Dridan, Stephan Oepen
Added 29 Sep 2012
Updated 29 Sep 2012
Type Journal
Year 2012
Where ACL
Authors Rebecca Dridan, Stephan Oepen
Comments (0)