We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Tr...
This paper presents an unsupervised approach to learning translation span alignments from parallel data that improves syntactic rule extraction by deleting spurious word alignment...
The language MIX consists of all strings over the three-letter alphabet {a, b, c} that contain an equal number of occurrences of each letter. We prove Joshi’s (1985) conjecture ...
In this paper we introduce the novel task of “word epoch disambiguation,” defined as the problem of identifying changes in word usage over time. Through experiments run using...
We introduce a spectral learning algorithm for latent-variable PCFGs (Petrov et al., 2006). Under a separability (singular value) condition, we prove that the method provides cons...
Shay B. Cohen, Karl Stratos, Michael Collins, Dean...