Many of the documents in large text collections are duplicates and versions of each other. In recent research, we developed new methods for finding such duplicates; however, as the...
Abstract. A large volume of data with complex structures is currently represented in GML (Geography Markup Language) for storing and exchanging geographic information. As the size ...
Increasingly, companies recognize that most of their important information does not exist in relational stores but in documents. For a long time, textual information has been rela...
This paper presents a novel algorithm for document clustering based on a combinatorial framework of the Principal Direction Divisive Partitioning (PDDP) algorithm [1] and a simpli...
Human-quality text summarization systems are di cult to design, and even more di cult to evaluate, in part because documents can di er along several dimensions, such as length, wri...
Jade Goldstein, Mark Kantrowitz, Vibhu O. Mittal, ...