Sciweavers

735 search results - page 79 / 147
» Corpora and data preparation
Sort
View
NIPS
2008
14 years 11 months ago
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization
The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes quest...
Liu Yang, Rong Jin, Rahul Sukthankar
IDEAS
2008
IEEE
80views Database» more  IDEAS 2008»
15 years 4 months ago
Improved count suffix trees for natural language data
With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string ...
Guido Sautter, Cristina Abba, Klemens Böhm
HPDC
2010
IEEE
14 years 11 months ago
Reshaping text data for efficient processing on Amazon EC2
Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a ...
Gabriela Turcu, Ian T. Foster, Svetlozar Nestorov
CSL
2006
Springer
14 years 10 months ago
A study in machine learning from imbalanced data for sentence boundary detection in speech
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have const...
Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizab...
TASLP
2008
143views more  TASLP 2008»
14 years 10 months ago
Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarizatio
Many current state-of-the-art speaker diarization systems exploit agglomerative hierarchical clustering (AHC) as their speaker clustering strategy, due to its simple processing str...
K. J. Han, S. Kim, S. S. Narayanan