Multi-Stream Speech Recognition



Within the area of speech recognition the paramount cause of the discrepancies between the performance of humans and machines is the lack of immunity of Automatic Speech Recognition (ASR) systems to variations in the acoustic signal not affecting the linguistic message; for example, variations stemming from a change of speaker or speaking style, environmental noise, or channel distortions. Conventional ASR systems rely on a single source of information, which is in sharp contrast to what we know about the way humans process speech. Both physiological and psycho-acoustic studies have shown, that human speech recognition is based on several, parallel information extractions from the speech signal. This indicates that incorporating more heterogeneous processing into ASR systems might be a way to leave a possible local performance maximum of current ASR systems. The multi-stream speech recognition framework differs from the more conventional single-stream ASR approaches in that instead of basing the recognition on a single line of information extraction from a single source of signal, multi-stream ASR systems rely on multiple information extraction methods that operate on potentially several sources of signal. The underlying principle of the multi-stream paradigm is that extracting and fusing diverse and complementary information may benefit the performance, since no optimal, error-free solution to the problem of extracting reliable information for speech recognition exists. One of the main questions arising in the multi-stream approach concerns the nature of the feature streams to combine. Nearly all previous multi-stream research has employed features designed prevailingly for conventional, single-stream systems. Typically, the features that have been chosen are those that have highest performance in isolation, under the assumption that this will lead to the highest performance when the features are combined. However, this assumption is not necessarily valid and many multi-stream approaches, although often demonstrating good performances, may appear rather ad-hoc. The goal of this work is to find a more principled way for choosing the features to combine; specifically, a data-driven approach is developed to tailor heterogeneous feature streams for a multi-stream framework. These feature streams may be unsuitable for a single-stream system, but because they are designed to complement each other may perform better in combination than conventional speech recognition features. One approach has been to introduce more phonetically motivated information into automatic speech recognition in the form of a phonetic `expert'. To avoid the curse of dimensionality problem, the expert information is introduced at the level of the acoustic model. Two types of experts have been used, each providing discriminative information regarding groups of phonetically related phonemes. The phonetic expert are implemented using an MLP. Experiments on a numbers recognition task have shown that, when using the expert in conjunction with both a full-band and a multi-band system speech recognition performances have been increased. Another approach focuses on the noise robustness of systems with heterogeneous features. In particular a system where different features are extracted for different sets of phonemes. The employed features are computed by applying a linear transformation, estimated in a data-driven fashion, to standard feature processing methods. The transformed features are tested in a set of experiments employing different system configurations. Overall the experiments suggests that employing more phoneme specific features can improve speech recognition. When testing the system on noisy speech with added car or factory noise, this tendency was maintained [Christensen, Lindberg and Andersen., 2001, 2002]. (Heidi Christensen, Børge Lindberg, Ove Andersen)
Effektiv start/slut dato31/12/200331/12/2003