Page segmentation algorithms found in published literatures often rely on some predetermined parameters such as general font sizes, distances between text lines and document scan ...
The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speechto-text (STT) and speaker-attributed speech-to-text (SAST...
An unsupervised classification algorithm is derived by modeling observed data as a mixture of several mutually exclusive classes that are each described by linear combinations of i...
Most text mining methods are based on representing documents using a vector space model, commonly known as a bag of word model, where each document is modeled as a linear vector r...
Rowena Chau, Ah Chung Tsoi, Markus Hagenbuchner, V...