On Using Written Language Training Data for Spoken Language Modeling

13 years 5 months ago

Download acl.ldc.upenn.edu

We attemped to improve recognition accuracy by reducing the inadequacies of the lexicon and language model. Specifically we address the following three problems: (1) the best size for the lexicon, (2) conditioning written text for spoken language recognition, and (3) using additional training outside the text distribution. We found that increasing the lexicon 20,000 words to 40,000 words reduced the percentage of words outside the vocabulary from over 2% to just 0.2%, thereby decreasing the error rate substantially. The error rate on words already in the vocabulary did not increase substantially. We modified the language model training text by applying rules to simulate the differences between the training text and what people actually said. Finally, we found that using another three years' of training text - even without the appropriate preprocessing, substantially improved the language model We also tested these approaches on spontaneous news dictation and found similar improve...

Richard M. Schwartz, Long Nguyen, Francis Kubala,

Real-time Traffic

Error Rate | Language Model | NAACL 1994 | NAACL 2007 | Training Text |

claim paper

» Contentbased language models for spoken document retrieval

» Spoken language translation from parallel speech audio Simultaneous interpretation as SLT ...

» System Combination for Machine Translation of Spoken and Written Language

» MultiClass Composite Ngram Language Model for Spoken Language Processing Using Multiple Wo...

» A Concept Model for Computerbased Spoken Language Tests

» Topic and styleadapted language modeling for Thai broadcast news ASR

» Automatic Extraction of Spoken Word in Broadcast Media Language

» ReRanking Models Basedon Small Training Data for Spoken Language Understanding

Post Info
More Details (n/a)

Added	02 Nov 2010
Updated	02 Nov 2010
Type	Conference
Year	1994
Where	NAACL
Authors	Richard M. Schwartz, Long Nguyen, Francis Kubala, George Chou, George Zavaliagkos, John Makhoul

Comments (0)

Sciweavers

On Using Written Language Training Data for Spoken Language Modeling

Error Rate | Language Model | NAACL 1994 | NAACL 2007 | Training Text |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers