Karen Livescu , Spoken Language Systems Group
Spoken language, especially conversational speech, is characterized by a great deal of variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This has been cited as a factor in the poor performance of automatic speech recognizers on conversational speech. One approach to handling this variation consists of expanding the dictionary with phonetic substitution, insertion, and deletion rules. This has the drawbacks that (1) many pronunciation variations typically remain unaccounted for, and (2) word confusability is increased due to the high granularity of phone units.
We present an alternative approach, in which many types of pronunciation variation are explained by representing speech as multiple streams of linguistic features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or to more abstract linguistic categories. By allowing for asynchrony between features and per-feature substitutions, many pronunciation changes that are difficult to account for with phone-based models become quite natural. Although it is well-known that many phenomena can be attributed to this ``semi-independent evolution'' of features, previous models of pronunciation variation have typically not taken advantage of this.
In particular, we propose a class of feature-based pronunciation models implemented using dynamic Bayesian networks (DBNs). The DBN approach allows us to naturally represent the factorization of the state space of feature combinations into factors corresponding to different features, as well as providing standard algorithms for inference and parameter learning. We investigate the behavior of such a model in isolation using manually transcribed speech data. These experiments suggest that when compared to a phone-based baseline, a feature-based model has both higher coverage of observed pronunciations and better recognition performance on isolated words excised from a conversational context. We also discuss the ways in which such a model can be incorporated into various types of end-to-end speech recognizers and present several examples of implemented systems, for both acoustic speech recognition and lipreading tasks.
No comments:
Post a Comment