A Semi-Supervised Document Clustering Algorithm Based on EM

13 years 3 months ago
A Semi-Supervised Document Clustering Algorithm Based on EM
Document clustering is a very hard task in Automatic Text Processing since it requires to extract regular patterns from a document collection without a priori knowledge on the category structure. This task can be difficult also for humans because many different but valid partitions may exist for the same collection. Moreover, the lack of information about categories makes it difficult to apply effective feature selection techniques to reduce the noise in the representation of texts. Despite these intrinsic difficulties, text clustering is an important task for Web search applications in which huge collections or quite long query result lists must be automatically organized. Semi-supervised clustering lies in between automatic categorization and auto-organization. It is assumed that the supervisor is not required to specify a set of classes, but only to provide a set of texts grouped by the criteria to be used to organize the collection. In this paper we present a novel algorithm fo...
Leonardo Rigutini, Marco Maggini
Added 28 Jun 2010
Updated 28 Jun 2010
Type Conference
Year 2005
Where WEBI
Authors Leonardo Rigutini, Marco Maggini
Comments (0)