On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

Morten Kolbæk; Zheng-Hua Tan; Jesper Jensen

doi:10.1109/TASLP.2018.2877909

On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

18 Citationer (Scopus)

207 Downloads (Pure)

Abstract

The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g., speech intelligibility. Short-time objective intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN-based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper, we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner, which is optimal with respect to the STOI speech intelligibility estimator.

Originalsprog	Engelsk
Artikelnummer	8509159
Tidsskrift	IEEE/ACM Transactions on Audio, Speech, and Language Processing
Vol/bind	27
Udgave nummer	2
Sider (fra-til)	283-295
Antal sider	13
ISSN	2329-9290
DOI	https://doi.org/10.1109/TASLP.2018.2877909
Status	Udgivet - feb. 2019

Adgang til dokumentet

10.1109/TASLP.2018.2877909

Green Open Access articleAccepteret manuskript, 4,66 MB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

http://www.scopus.com/inward/record.url?scp=85055695562&partnerID=8YFLogxK

Citationsformater

@article{b2fc9b8375e54291be70d85bafa758be,

title = "On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement",

abstract = "The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g., speech intelligibility. Short-time objective intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN-based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper, we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner, which is optimal with respect to the STOI speech intelligibility estimator.",

keywords = "Speech enhancement, deep neural networks, minimum mean-square error estimator, speech intelligibility",

author = "Morten Kolb{\ae}k and Zheng-Hua Tan and Jesper Jensen",

year = "2019",

month = feb,

doi = "10.1109/TASLP.2018.2877909",

language = "English",

volume = "27",

pages = "283--295",

journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",

issn = "2329-9290",

publisher = "IEEE Signal Processing Society",

number = "2",

}

On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement. / Kolbæk, Morten; Tan, Zheng-Hua ; Jensen, Jesper.
I: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Bind 27, Nr. 2, 8509159, 02.2019, s. 283-295.

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

TY - JOUR

T1 - On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

AU - Kolbæk, Morten

AU - Tan, Zheng-Hua

AU - Jensen, Jesper

PY - 2019/2

Y1 - 2019/2

N2 - The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g., speech intelligibility. Short-time objective intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN-based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper, we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner, which is optimal with respect to the STOI speech intelligibility estimator.

AB - The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g., speech intelligibility. Short-time objective intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN-based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper, we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner, which is optimal with respect to the STOI speech intelligibility estimator.

KW - Speech enhancement

KW - deep neural networks

KW - minimum mean-square error estimator

KW - speech intelligibility

UR - http://www.scopus.com/inward/record.url?scp=85055695562&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2018.2877909

DO - 10.1109/TASLP.2018.2877909

M3 - Journal article

SN - 2329-9290

VL - 27

SP - 283

EP - 295

JO - IEEE/ACM Transactions on Audio, Speech, and Language Processing

JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing

IS - 2

M1 - 8509159

ER -

On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater