An Experimental Study on Light Speech Features for Small-Footprint Keyword Spotting

Ivan Lopez Espejo; Zheng-Hua Tan; Jesper Jensen

doi:10.21437/IberSPEECH.2022-27

An Experimental Study on Light Speech Features for Small-Footprint Keyword Spotting

Ivan Lopez Espejo, Zheng-Hua Tan, Jesper Jensen

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

43 Downloads (Pure)

Abstract

Keyword spotting (KWS) is, in many instances, intended to run on smart electronic devices characterized by limited computational resources. To meet computational constraints, a series of techniques —ranging from feature and acoustic model parameter quantization to the reduction of the number of model
parameters and required multiplications— has been explored in the literature. With this same aim, in this paper, we study a straightforward alternative consisting of the reduction of the spectro/cepstro-temporal resolution of log-Mel and Melfrequency cepstral coefficient feature matrices commonly employed in KWS. We show that the feature matrix size has a strong impact on the number of multiplications/energy consumption of a state-of-the-art KWS acoustic model based on convolutional neural network. Experimental results demonstrate that the number of elements in commonly used speech feature matrices can be reduced by a factor of 8 while essentially maintaining KWS performance. Even more interestingly, this size reduction leads to a 9.6× number of multiplications/energy consumption, 4.0× training time and 3.7× inference time reduction.

Original language	English
Title of host publication	IberSPEECH 2022
Publication date	2022
DOIs	https://doi.org/10.21437/IberSPEECH.2022-27
Publication status	Published - 2022
Event	IberSPEECH 2022 - Granada, Spain Duration: 14 Nov 2022 → 16 Nov 2022

Conference

Conference	IberSPEECH 2022
Country/Territory	Spain
City	Granada
Period	14/11/2022 → 16/11/2022

Access to Document

10.21437/IberSPEECH.2022-27

lopezespejo22_iberspeechFinal published version, 409 KB

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@inproceedings{9578dde1bec640b58efdee47458bb2db,

title = "An Experimental Study on Light Speech Features for Small-Footprint Keyword Spotting",

abstract = "Keyword spotting (KWS) is, in many instances, intended to run on smart electronic devices characterized by limited computational resources. To meet computational constraints, a series of techniques —ranging from feature and acoustic model parameter quantization to the reduction of the number of modelparameters and required multiplications— has been explored in the literature. With this same aim, in this paper, we study a straightforward alternative consisting of the reduction of the spectro/cepstro-temporal resolution of log-Mel and Melfrequency cepstral coefficient feature matrices commonly employed in KWS. We show that the feature matrix size has a strong impact on the number of multiplications/energy consumption of a state-of-the-art KWS acoustic model based on convolutional neural network. Experimental results demonstrate that the number of elements in commonly used speech feature matrices can be reduced by a factor of 8 while essentially maintaining KWS performance. Even more interestingly, this size reduction leads to a 9.6× number of multiplications/energy consumption, 4.0× training time and 3.7× inference time reduction.",

author = "Espejo, {Ivan Lopez} and Zheng-Hua Tan and Jesper Jensen",

year = "2022",

doi = "10.21437/IberSPEECH.2022-27",

language = "English",

booktitle = "IberSPEECH 2022",

note = "IberSPEECH 2022 ; Conference date: 14-11-2022 Through 16-11-2022",

}

TY - GEN

T1 - An Experimental Study on Light Speech Features for Small-Footprint Keyword Spotting

AU - Espejo, Ivan Lopez

AU - Tan, Zheng-Hua

AU - Jensen, Jesper

PY - 2022

Y1 - 2022

N2 - Keyword spotting (KWS) is, in many instances, intended to run on smart electronic devices characterized by limited computational resources. To meet computational constraints, a series of techniques —ranging from feature and acoustic model parameter quantization to the reduction of the number of modelparameters and required multiplications— has been explored in the literature. With this same aim, in this paper, we study a straightforward alternative consisting of the reduction of the spectro/cepstro-temporal resolution of log-Mel and Melfrequency cepstral coefficient feature matrices commonly employed in KWS. We show that the feature matrix size has a strong impact on the number of multiplications/energy consumption of a state-of-the-art KWS acoustic model based on convolutional neural network. Experimental results demonstrate that the number of elements in commonly used speech feature matrices can be reduced by a factor of 8 while essentially maintaining KWS performance. Even more interestingly, this size reduction leads to a 9.6× number of multiplications/energy consumption, 4.0× training time and 3.7× inference time reduction.

AB - Keyword spotting (KWS) is, in many instances, intended to run on smart electronic devices characterized by limited computational resources. To meet computational constraints, a series of techniques —ranging from feature and acoustic model parameter quantization to the reduction of the number of modelparameters and required multiplications— has been explored in the literature. With this same aim, in this paper, we study a straightforward alternative consisting of the reduction of the spectro/cepstro-temporal resolution of log-Mel and Melfrequency cepstral coefficient feature matrices commonly employed in KWS. We show that the feature matrix size has a strong impact on the number of multiplications/energy consumption of a state-of-the-art KWS acoustic model based on convolutional neural network. Experimental results demonstrate that the number of elements in commonly used speech feature matrices can be reduced by a factor of 8 while essentially maintaining KWS performance. Even more interestingly, this size reduction leads to a 9.6× number of multiplications/energy consumption, 4.0× training time and 3.7× inference time reduction.

U2 - 10.21437/IberSPEECH.2022-27

DO - 10.21437/IberSPEECH.2022-27

M3 - Article in proceeding

BT - IberSPEECH 2022

T2 - IberSPEECH 2022

Y2 - 14 November 2022 through 16 November 2022

ER -

An Experimental Study on Light Speech Features for Small-Footprint Keyword Spotting

Abstract

Conference

Access to Document

AUB Link

Fingerprint

Cite this