The problem of document classification considers categorizing or grouping of various document types. Each document can be represented as a bag of words, which has no straightforw...
Abstract— With XML becoming a standard for data representation and exchange, we can expect to see large scale repositories and warehouses of XML data. In order for users to under...
Incremental hierarchical text document clustering algorithms are important in organizing documents generated from streaming on-line sources, such as, Newswire and Blogs. However, ...
In this paper, we report the development and experiments of IBM Content Harvester (CH), a tool to analyze and recover templates and content from word processor created text docume...
Large collections of documents are commonly created around a database, where a typical database schema may contain hundreds of tables and thousands of columns. We developed a syst...
Carlos Garcia-Alvarado, Carlos Ordonez, Zhibo Chen...