TY - JOUR
T1 - On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement
AU - Kolbæk, Morten
AU - Tan, Zheng-Hua
AU - Jensen, Jesper
PY - 2019/2
Y1 - 2019/2
N2 - The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g., speech intelligibility. Short-time objective intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN-based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper, we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner, which is optimal with respect to the STOI speech intelligibility estimator.
AB - The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g., speech intelligibility. Short-time objective intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN-based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper, we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner, which is optimal with respect to the STOI speech intelligibility estimator.
KW - Speech enhancement
KW - deep neural networks
KW - minimum mean-square error estimator
KW - speech intelligibility
UR - http://www.scopus.com/inward/record.url?scp=85055695562&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2018.2877909
DO - 10.1109/TASLP.2018.2877909
M3 - Journal article
SN - 2329-9290
VL - 27
SP - 283
EP - 295
JO - IEEE/ACM Transactions on Audio, Speech, and Language Processing
JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing
IS - 2
M1 - 8509159
ER -