Multimodal Phoneme Recognition of Meeting Data

15 years 5 months ago

Download www.fit.vutbr.cz

This paper describes experiments in automatic recognition of context-independent phoneme strings from meeting data using audiovisual features. Visual features are known to improve accuracy and noise robustness of automatic speech recognizers. However, many problems appear when not “visually clean” data is provided, such as data without limited variation in the speaker’s frontal pose, lighting conditions, background, etc. The goal of this work was to test whether visual information can be helpful for recognition of phonemes using neural nets. While the audio part is ﬁxed and uses standard Mel ﬁlter-bank energies, diﬀerent features describing the video were tested: average brightness, DCT coeﬃcients extracted from region-of-interest (ROI), optical ﬂow analysis and lip-position features. The recognition was evaluated on a sub-set of IDIAP meeting room data. We have seen small improvement when compared to purely audio-recognition, but further work needs to be done especiall...

Petr Motlícek, Jan Cernocký

Real-time Traffic