Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices

Ivan Lopez-Espejo; Zheng-Hua Tan; Jesper Jensen

doi:10.1109/TASLP.2020.2984089

Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices

Ivan Lopez-Espejo, Zheng-Hua Tan, Jesper Jensen

Research output: Contribution to journal › Journal article › Research › peer-review

12 Citations (Scopus)

95 Downloads (Pure)

Abstract

For certain applications, keyword spotting (KWS) requires some degree of personalization. This is the case for KWS for hearing assistive devices, e.g., hearing aids, where only the device user should be allowed to trigger the KWS system. In this paper, we first develop a new realistic hearing aid experimental framework. Next, using this framework we show that the performance of a state-of-the-art multi-task deep learning architecture exploiting cepstral features for joint KWS and users' own-voice/external speaker detection drops significantly. To overcome this problem, we use phase difference information through GCC-PHAT (Generalized Cross-Correlation with PHAse Transform)-based coefficients along with log-spectral magnitude features. In addition, we demonstrate that working in the perceptually-motivated constant-Q transform (CQT) domain instead of in the short-time Fourier transform (STFT) domain allows for the generation of compact and coherent features which provide superior KWS performance. Our experimental results show that our CQT-based proposal achieves a relative KWS accuracy improvement of around 18% compared to using cepstral features while dramatically decreasing the number of multiplications in the multi-task architecture, which is key in the context of low-resource devices like hearing assistive devices.

Original language	English
Article number	9054977
Journal	IEEE/ACM Transactions on Audio, Speech, and Language Processing
Volume	28
Pages (from-to)	1233-1247
Number of pages	15
ISSN	2329-9290
DOIs	https://doi.org/10.1109/TASLP.2020.2984089
Publication status	Published - Apr 2020

Keywords

Constant-Q transform
External speaker
Generalized cross-correlation
Hearing assistive device
Multi-task learning
Robust keyword spotting

Access to Document

10.1109/TASLP.2020.2984089

Accepted manuscriptAccepted author manuscript, 1.36 MB

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@article{059e385bd3b045a2a20040b206bc9492,

title = "Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices",

abstract = "For certain applications, keyword spotting (KWS) requires some degree of personalization. This is the case for KWS for hearing assistive devices, e.g., hearing aids, where only the device user should be allowed to trigger the KWS system. In this paper, we first develop a new realistic hearing aid experimental framework. Next, using this framework we show that the performance of a state-of-the-art multi-task deep learning architecture exploiting cepstral features for joint KWS and users' own-voice/external speaker detection drops significantly. To overcome this problem, we use phase difference information through GCC-PHAT (Generalized Cross-Correlation with PHAse Transform)-based coefficients along with log-spectral magnitude features. In addition, we demonstrate that working in the perceptually-motivated constant-Q transform (CQT) domain instead of in the short-time Fourier transform (STFT) domain allows for the generation of compact and coherent features which provide superior KWS performance. Our experimental results show that our CQT-based proposal achieves a relative KWS accuracy improvement of around 18% compared to using cepstral features while dramatically decreasing the number of multiplications in the multi-task architecture, which is key in the context of low-resource devices like hearing assistive devices.",

keywords = "Constant-Q transform, External speaker, Generalized cross-correlation, Hearing assistive device, Multi-task learning, Robust keyword spotting",

author = "Ivan Lopez-Espejo and Zheng-Hua Tan and Jesper Jensen",

year = "2020",

month = apr,

doi = "10.1109/TASLP.2020.2984089",

language = "English",

volume = "28",

pages = "1233--1247",

journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",

issn = "2329-9290",

publisher = "IEEE Signal Processing Society",

}

TY - JOUR

T1 - Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices

AU - Lopez-Espejo, Ivan

AU - Tan, Zheng-Hua

AU - Jensen, Jesper

PY - 2020/4

Y1 - 2020/4

N2 - For certain applications, keyword spotting (KWS) requires some degree of personalization. This is the case for KWS for hearing assistive devices, e.g., hearing aids, where only the device user should be allowed to trigger the KWS system. In this paper, we first develop a new realistic hearing aid experimental framework. Next, using this framework we show that the performance of a state-of-the-art multi-task deep learning architecture exploiting cepstral features for joint KWS and users' own-voice/external speaker detection drops significantly. To overcome this problem, we use phase difference information through GCC-PHAT (Generalized Cross-Correlation with PHAse Transform)-based coefficients along with log-spectral magnitude features. In addition, we demonstrate that working in the perceptually-motivated constant-Q transform (CQT) domain instead of in the short-time Fourier transform (STFT) domain allows for the generation of compact and coherent features which provide superior KWS performance. Our experimental results show that our CQT-based proposal achieves a relative KWS accuracy improvement of around 18% compared to using cepstral features while dramatically decreasing the number of multiplications in the multi-task architecture, which is key in the context of low-resource devices like hearing assistive devices.

AB - For certain applications, keyword spotting (KWS) requires some degree of personalization. This is the case for KWS for hearing assistive devices, e.g., hearing aids, where only the device user should be allowed to trigger the KWS system. In this paper, we first develop a new realistic hearing aid experimental framework. Next, using this framework we show that the performance of a state-of-the-art multi-task deep learning architecture exploiting cepstral features for joint KWS and users' own-voice/external speaker detection drops significantly. To overcome this problem, we use phase difference information through GCC-PHAT (Generalized Cross-Correlation with PHAse Transform)-based coefficients along with log-spectral magnitude features. In addition, we demonstrate that working in the perceptually-motivated constant-Q transform (CQT) domain instead of in the short-time Fourier transform (STFT) domain allows for the generation of compact and coherent features which provide superior KWS performance. Our experimental results show that our CQT-based proposal achieves a relative KWS accuracy improvement of around 18% compared to using cepstral features while dramatically decreasing the number of multiplications in the multi-task architecture, which is key in the context of low-resource devices like hearing assistive devices.

KW - Constant-Q transform

KW - External speaker

KW - Generalized cross-correlation

KW - Hearing assistive device

KW - Multi-task learning

KW - Robust keyword spotting

UR - http://www.scopus.com/inward/record.url?scp=85084399634&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2020.2984089

DO - 10.1109/TASLP.2020.2984089

M3 - Journal article

SN - 2329-9290

VL - 28

SP - 1233

EP - 1247

JO - IEEE/ACM Transactions on Audio, Speech, and Language Processing

JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing

M1 - 9054977

ER -

Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices

Abstract

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this