Wednesday, June 07, 2006

MIT CSAIL talk : Head-pose and Illumination Invariant 3-D Audio-Visual Speech Recognition

Speaker: Dimitri Bitouk , Johns Hopkins University
Date: Tuesday, June 6 2006
Time: 1:00PM to 2:00PM
Location: 32-346
Host: Karen Livescu,
CSAILContact: Karen Livescu, 617-253-5953, klivescu@csail.mit.edu

Abstract:

Speech perception is bimodal, employing not only the acoustic signal, but also visual cues. Audio-visual speech recognition aims to improve the performance of conventional automated speech recognition by incorporating visual information. Due to a fundamentally limited two-dimensional representation employed, current approaches for visual feature extraction lack invariance to speaker's pose and illumination in the environment. The research presented in this thesis aims to develop three-dimensional methods for visual feature extraction that alleviate the above-mentioned limitation. Following the concepts of Grenander's General Pattern Theory, the prior knowledge of speaker's face is described by a prototype, which consists of a 3-D surface and a texture. The variability in observed video images of a speaker associated with pose, articulatory facial motion, and illumination is represented by transformations acting on the prototype and forming the group of geometric and photometric variability. Facial motion is described as smooth deformations of the prototype surface and is learned from motion capture data. The effects of illumination are accommodated by analytically constructing surface scalar fields that express relative changes in the face surface irradiance. We derive a multi-resolution tracking algorithm for estimation of speaker's pose, articulatory facial motion and illumination from uncalibrated monocular video sequences. The inferred facial motion parameters are utilized as visual features in audio-visual speech recognition. An application of our approach to large-vocabulary audio-visual speech recognition is presented. Speaker-independent speech recognition combines audio and visual models trained at the utterance level. We demonstrate that the visual features derived using our 3-D approach significantly improve speech recognition performance across a wide range of acoustic noise signal-to-noise ratios.

No comments: