Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features: A Theoretically Consistent Approach

Research output: Contribution to journalJournal articleResearchpeer-review

10 Citations (Scopus)

Abstract

In this work we consider the problem of feature enhancement for noise-robust automatic speech recognition (ASR). We propose a method for minimum mean-square error (MMSE) estimation of mel-frequency cepstral features, which is based on a minimum number of well-established, theoretically consistent statistical assumptions. More specifically, the method belongs to the class of methods relying on the statistical framework proposed in Ephraim and Malah’s original work [1]. The method is general in that it allows MMSE estimation of mel-frequency cepstral coefficients (MFCC’s), cepstral-mean subtracted (CMS-) MFCC’s, autoregressive-moving-average (ARMA)-filtered CMSMFCC’s, velocity, and acceleration coefficients. In addition, the method is easily modified to take into account other compressive non-linearities than the logarithm traditionally used for MFCC computation. In terms of MFCC estimation performance, as measured by MFCC mean-square error, the proposed method shows performance, which is identical to or better than other state-of-the-art methods. In terms of ASR performance, no statistical difference could be found between the proposed method and the state-of-the-art methods. We conclude that existing state-of-the-art MFCC feature enhancement algorithms within this class of algorithms, while theoretically suboptimal or based on theoretically inconsistent assumptions, perform close to optimally in the MMSE sense.
Original languageEnglish
JournalI E E E Transactions on Audio, Speech and Language Processing
Volume23
Issue number1
Pages (from-to)186 - 197
ISSN1558-7916
DOIs
Publication statusPublished - Jan 2015

Fingerprint

Mean square error
Error analysis
coefficients
Speech recognition
speech recognition
Acoustic noise
autoregressive moving average
augmentation
logarithms
nonlinearity

Cite this

@article{384767c881cf4c1e9761aa18aeceaba1,
title = "Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features: A Theoretically Consistent Approach",
abstract = "In this work we consider the problem of feature enhancement for noise-robust automatic speech recognition (ASR). We propose a method for minimum mean-square error (MMSE) estimation of mel-frequency cepstral features, which is based on a minimum number of well-established, theoretically consistent statistical assumptions. More specifically, the method belongs to the class of methods relying on the statistical framework proposed in Ephraim and Malah’s original work [1]. The method is general in that it allows MMSE estimation of mel-frequency cepstral coefficients (MFCC’s), cepstral-mean subtracted (CMS-) MFCC’s, autoregressive-moving-average (ARMA)-filtered CMSMFCC’s, velocity, and acceleration coefficients. In addition, the method is easily modified to take into account other compressive non-linearities than the logarithm traditionally used for MFCC computation. In terms of MFCC estimation performance, as measured by MFCC mean-square error, the proposed method shows performance, which is identical to or better than other state-of-the-art methods. In terms of ASR performance, no statistical difference could be found between the proposed method and the state-of-the-art methods. We conclude that existing state-of-the-art MFCC feature enhancement algorithms within this class of algorithms, while theoretically suboptimal or based on theoretically inconsistent assumptions, perform close to optimally in the MMSE sense.",
author = "Jesper Jensen and Zheng-Hua Tan",
year = "2015",
month = "1",
doi = "10.1109/TASLP.2014.2377591",
language = "English",
volume = "23",
pages = "186 -- 197",
journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
issn = "2329-9290",
publisher = "IEEE Signal Processing Society",
number = "1",

}

Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features : A Theoretically Consistent Approach. / Jensen, Jesper; Tan, Zheng-Hua.

In: I E E E Transactions on Audio, Speech and Language Processing, Vol. 23, No. 1, 01.2015, p. 186 - 197.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features

T2 - A Theoretically Consistent Approach

AU - Jensen, Jesper

AU - Tan, Zheng-Hua

PY - 2015/1

Y1 - 2015/1

N2 - In this work we consider the problem of feature enhancement for noise-robust automatic speech recognition (ASR). We propose a method for minimum mean-square error (MMSE) estimation of mel-frequency cepstral features, which is based on a minimum number of well-established, theoretically consistent statistical assumptions. More specifically, the method belongs to the class of methods relying on the statistical framework proposed in Ephraim and Malah’s original work [1]. The method is general in that it allows MMSE estimation of mel-frequency cepstral coefficients (MFCC’s), cepstral-mean subtracted (CMS-) MFCC’s, autoregressive-moving-average (ARMA)-filtered CMSMFCC’s, velocity, and acceleration coefficients. In addition, the method is easily modified to take into account other compressive non-linearities than the logarithm traditionally used for MFCC computation. In terms of MFCC estimation performance, as measured by MFCC mean-square error, the proposed method shows performance, which is identical to or better than other state-of-the-art methods. In terms of ASR performance, no statistical difference could be found between the proposed method and the state-of-the-art methods. We conclude that existing state-of-the-art MFCC feature enhancement algorithms within this class of algorithms, while theoretically suboptimal or based on theoretically inconsistent assumptions, perform close to optimally in the MMSE sense.

AB - In this work we consider the problem of feature enhancement for noise-robust automatic speech recognition (ASR). We propose a method for minimum mean-square error (MMSE) estimation of mel-frequency cepstral features, which is based on a minimum number of well-established, theoretically consistent statistical assumptions. More specifically, the method belongs to the class of methods relying on the statistical framework proposed in Ephraim and Malah’s original work [1]. The method is general in that it allows MMSE estimation of mel-frequency cepstral coefficients (MFCC’s), cepstral-mean subtracted (CMS-) MFCC’s, autoregressive-moving-average (ARMA)-filtered CMSMFCC’s, velocity, and acceleration coefficients. In addition, the method is easily modified to take into account other compressive non-linearities than the logarithm traditionally used for MFCC computation. In terms of MFCC estimation performance, as measured by MFCC mean-square error, the proposed method shows performance, which is identical to or better than other state-of-the-art methods. In terms of ASR performance, no statistical difference could be found between the proposed method and the state-of-the-art methods. We conclude that existing state-of-the-art MFCC feature enhancement algorithms within this class of algorithms, while theoretically suboptimal or based on theoretically inconsistent assumptions, perform close to optimally in the MMSE sense.

U2 - 10.1109/TASLP.2014.2377591

DO - 10.1109/TASLP.2014.2377591

M3 - Journal article

VL - 23

SP - 186

EP - 197

JO - IEEE/ACM Transactions on Audio, Speech, and Language Processing

JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing

SN - 2329-9290

IS - 1

ER -