Sciweavers

CSL
2006
Springer

A study in machine learning from imbalanced data for sentence boundary detection in speech

13 years 4 months ago
A study in machine learning from imbalanced data for sentence boundary detection in speech
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed an HMM system to detect sentence boundaries that uses both the prosodic and textual information. In this system, the sentence boundaries are detected by building a classi er in which at each interword boundary, a decision is made as to whether or not it ends a sentence. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classi er, must be constructed to e ectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora, using both the reference transcription and the recognition output. In the pilot study, w...
Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizab
Added 11 Dec 2010
Updated 11 Dec 2010
Type Journal
Year 2006
Where CSL
Authors Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizabeth Shriberg, Andreas Stolcke
Comments (0)