We develop a quantitative method to assess the style of American poems and to visualize a collection of poems in relation to one another. Qualitative poetry criticism helped guide...
This paper addresses personal E-mail filtering by casting it in the framework of text classification. Modeled as semi-structured documents, Email messages consist of a set of field...
Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis ov...
The Online Database of Interlinear Text (ODIN)1 is a database of interlinear text "snippets", harvested mostly from scholarly documents posted to the Web. Although large...
A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal...