TY - JOUR
T1 - Speech Intelligibility Prediction using Spectro-Temporal Modulation Analysis
AU - Edraki, Amin
AU - Chan, Wai Yip Geoffrey
AU - Jensen, Jesper
AU - Fogerty, Daniel
N1 - Publisher Copyright:
IEEE
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/12
Y1 - 2020/12
N2 - Spectro-temporal modulations are believed to mediate the analysis of speech sounds in the human primary auditory cortex. Inspired by humans' robustness in comprehending speech in challenging acoustic environments, we propose an intrusive speech intelligibility prediction (SIP) algorithm, wSTMI, for normal-hearing listeners based on spectro-temporal modulation analysis (STMA) of the clean and degraded speech signals. In the STMA, each of 55 modulation frequency channels contributes an intermediate intelligibility measure. A sparse linear model with parameters optimized using Lasso regression results in combining the intermediate measures of 8 of the most salient channels for SIP. In comparison with a suite of 10 SIP algorithms, wSTMI performs consistently well across 13 datasets, which together cover degradation conditions including modulated noise, noise reduction processing, reverberation, near-end listening enhancement, and speech interruption. We show that the optimized parameters of wSTMI may be interpreted in terms of modulation transfer functions of the human auditory system. Thus, the proposed approach offers evidence affirming previous studies of the perceptual characteristics underlying speech signal intelligibility.
AB - Spectro-temporal modulations are believed to mediate the analysis of speech sounds in the human primary auditory cortex. Inspired by humans' robustness in comprehending speech in challenging acoustic environments, we propose an intrusive speech intelligibility prediction (SIP) algorithm, wSTMI, for normal-hearing listeners based on spectro-temporal modulation analysis (STMA) of the clean and degraded speech signals. In the STMA, each of 55 modulation frequency channels contributes an intermediate intelligibility measure. A sparse linear model with parameters optimized using Lasso regression results in combining the intermediate measures of 8 of the most salient channels for SIP. In comparison with a suite of 10 SIP algorithms, wSTMI performs consistently well across 13 datasets, which together cover degradation conditions including modulated noise, noise reduction processing, reverberation, near-end listening enhancement, and speech interruption. We show that the optimized parameters of wSTMI may be interpreted in terms of modulation transfer functions of the human auditory system. Thus, the proposed approach offers evidence affirming previous studies of the perceptual characteristics underlying speech signal intelligibility.
KW - spectro-temporal modulation
KW - speech intelligibility
KW - speech quality model
UR - http://www.scopus.com/inward/record.url?scp=85097126929&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2020.3039929
DO - 10.1109/TASLP.2020.3039929
M3 - Journal article
AN - SCOPUS:85097126929
SN - 2329-9290
VL - 29
SP - 210
EP - 225
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
M1 - 9269417
ER -