TY - JOUR
T1 - Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices
AU - Lopez-Espejo, Ivan
AU - Tan, Zheng-Hua
AU - Jensen, Jesper
PY - 2020/4
Y1 - 2020/4
N2 - For certain applications, keyword spotting (KWS) requires some degree of personalization. This is the case for KWS for hearing assistive devices, e.g., hearing aids, where only the device user should be allowed to trigger the KWS system. In this paper, we first develop a new realistic hearing aid experimental framework. Next, using this framework we show that the performance of a state-of-the-art multi-task deep learning architecture exploiting cepstral features for joint KWS and users' own-voice/external speaker detection drops significantly. To overcome this problem, we use phase difference information through GCC-PHAT (Generalized Cross-Correlation with PHAse Transform)-based coefficients along with log-spectral magnitude features. In addition, we demonstrate that working in the perceptually-motivated constant-Q transform (CQT) domain instead of in the short-time Fourier transform (STFT) domain allows for the generation of compact and coherent features which provide superior KWS performance. Our experimental results show that our CQT-based proposal achieves a relative KWS accuracy improvement of around 18% compared to using cepstral features while dramatically decreasing the number of multiplications in the multi-task architecture, which is key in the context of low-resource devices like hearing assistive devices.
AB - For certain applications, keyword spotting (KWS) requires some degree of personalization. This is the case for KWS for hearing assistive devices, e.g., hearing aids, where only the device user should be allowed to trigger the KWS system. In this paper, we first develop a new realistic hearing aid experimental framework. Next, using this framework we show that the performance of a state-of-the-art multi-task deep learning architecture exploiting cepstral features for joint KWS and users' own-voice/external speaker detection drops significantly. To overcome this problem, we use phase difference information through GCC-PHAT (Generalized Cross-Correlation with PHAse Transform)-based coefficients along with log-spectral magnitude features. In addition, we demonstrate that working in the perceptually-motivated constant-Q transform (CQT) domain instead of in the short-time Fourier transform (STFT) domain allows for the generation of compact and coherent features which provide superior KWS performance. Our experimental results show that our CQT-based proposal achieves a relative KWS accuracy improvement of around 18% compared to using cepstral features while dramatically decreasing the number of multiplications in the multi-task architecture, which is key in the context of low-resource devices like hearing assistive devices.
KW - Constant-Q transform
KW - External speaker
KW - Generalized cross-correlation
KW - Hearing assistive device
KW - Multi-task learning
KW - Robust keyword spotting
UR - http://www.scopus.com/inward/record.url?scp=85084399634&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2020.2984089
DO - 10.1109/TASLP.2020.2984089
M3 - Journal article
SN - 2329-9290
VL - 28
SP - 1233
EP - 1247
JO - IEEE/ACM Transactions on Audio, Speech, and Language Processing
JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing
M1 - 9054977
ER -