A spoken term detection framework for recovering out-of-vocabulary words using the web

12 years 11 months ago

Download www.cs.jhu.edu

Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error.

Carolina Parada, Abhinav Sethy, Mark Dredze, Frede

Real-time Traffic

Downstream Nlp Technologies | INTERSPEECH 2010 | Signal Processing | Spoken Term Detection | Vocabulary Continuous Speech |

claim paper

» A comparison of grapheme and phonemebased units for Spanish spoken term detection

» Three phase verification for spoken dialog clarification

» Exploiting Speech Recognition Transcripts for Narrative Peak Detection in ShortForm Docume...

» TalkMiner a search engine for online lecture video

» Combining link and content for community detection a discriminative approach

Post Info
More Details (n/a)

Added	18 May 2011
Updated	18 May 2011
Type	Journal
Year	2010
Where	INTERSPEECH
Authors	Carolina Parada, Abhinav Sethy, Mark Dredze, Frederick Jelinek

Comments (0)

Sciweavers

A spoken term detection framework for recovering out-of-vocabulary words using the web

Downstream Nlp Technologies | INTERSPEECH 2010 | Signal Processing | Spoken Term Detection | Vocabulary Continuous Speech |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers