Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

Zheng-Hua Tan; Børge Lindberg

doi:10.1109/JSTSP.2010.2057192

Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

Zheng-Hua Tan, Børge Lindberg

Department of Electronic Systems

Research output: Contribution to journal › Journal article › Research › peer-review

82 Citations (Scopus)

Abstract

Frame based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used for scalable source coding schemes in distributed speech recognition where the target bit rate is met by adjusting the frame rate. Thirdly, it is applied to voice activity detection. Very encouraging results are obtained for all three speech processing tasks.

Original language	English
Journal	IEEE Journal of Selected Topics in Signal Processing
Volume	4
Issue number	5
Pages (from-to)	798-807
ISSN	1932-4553
DOIs	https://doi.org/10.1109/JSTSP.2010.2057192
Publication status	Published - Oct 2010

Keywords

Distributed speech recognition
frame selection
voice activity detection
noise-robust speech recognition
variable frame rate

Access to Document

10.1109/JSTSP.2010.2057192

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@article{4fe578f0d4ed11debb13000ea68e967b,

title = "Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection",

abstract = "Frame based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used for scalable source coding schemes in distributed speech recognition where the target bit rate is met by adjusting the frame rate. Thirdly, it is applied to voice activity detection. Very encouraging results are obtained for all three speech processing tasks. ",

keywords = "Distributed speech recognition, frame selection, voice activity detection, noise-robust speech recognition, variable frame rate",

author = "Zheng-Hua Tan and B{\o}rge Lindberg",

year = "2010",

month = oct,

doi = "10.1109/JSTSP.2010.2057192",

language = "English",

volume = "4",

pages = "798--807",

journal = "IEEE Journal of Selected Topics in Signal Processing",

issn = "1932-4553",

publisher = "IEEE",

number = "5",

}

TY - JOUR

T1 - Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

AU - Tan, Zheng-Hua

AU - Lindberg, Børge

PY - 2010/10

Y1 - 2010/10

N2 - Frame based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used for scalable source coding schemes in distributed speech recognition where the target bit rate is met by adjusting the frame rate. Thirdly, it is applied to voice activity detection. Very encouraging results are obtained for all three speech processing tasks.

AB - Frame based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used for scalable source coding schemes in distributed speech recognition where the target bit rate is met by adjusting the frame rate. Thirdly, it is applied to voice activity detection. Very encouraging results are obtained for all three speech processing tasks.

KW - Distributed speech recognition

KW - frame selection

KW - voice activity detection

KW - noise-robust speech recognition

KW - variable frame rate

U2 - 10.1109/JSTSP.2010.2057192

DO - 10.1109/JSTSP.2010.2057192

M3 - Journal article

SN - 1932-4553

VL - 4

SP - 798

EP - 807

JO - IEEE Journal of Selected Topics in Signal Processing

JF - IEEE Journal of Selected Topics in Signal Processing

IS - 5

ER -

Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

Abstract

Keywords

Access to Document

AUB Link

Fingerprint

Cite this