We present a novel extension to a recently proposed incremental learning algorithm for the word segmentation problem originally introduced in Goldwater (2006). By adding rejuvenat...
In this paper we introduce the novel task of “word epoch disambiguation,” defined as the problem of identifying changes in word usage over time. Through experiments run using...
We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Tr...
The language MIX consists of all strings over the three-letter alphabet {a, b, c} that contain an equal number of occurrences of each letter. We prove Joshi’s (1985) conjecture ...
This paper describes Movie-DiC a Movie Dialogue Corpus recently collected for research and development purposes. The collected dataset comprises 132,229 dialogues containing a tot...