Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

Morten Kolbæk; Zheng-Hua Tan; Jesper Jensen

doi:10.1109/ICASSP.2018.8462040

Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

51 Citations (Scopus)

Abstract

In this paper we propose a Deep Neural Network (D NN) based Speech Enhancement (SE) system that is designed to maximize an approximation of the Short-Time Objective Intelligibility (STOI) measure. We formalize an approximate-STOI cost function and derive analytical expressions for the gradients required for DNN training and show that these gradients have desirable properties when used together with gradient based optimization techniques. We show through simulation experiments that the proposed SE system achieves large improvements in estimated speech intelligibility, when tested on matched and unmatched natural noise types, at multiple signal-to-noise ratios. Furthermore, we show that the SE system, when trained using an approximate-STOI cost function performs on par with a system trained with a mean square error cost applied to short-time temporal envelopes. Finally, we show that the proposed SE system performs on par with a traditional DNN based Short- Time Spectral Amplitude (STSA) SE system in terms of estimated speech intelligibility. These results are important because they suggest that traditional DNN based STSA SE systems might be optimal in terms of estimated speech intelligibility.

Original language	English
Title of host publication	2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
Number of pages	5
Publisher	IEEE
Publication date	2018
Pages	5059-5063
Article number	8462040
ISBN (Print)	9781538646588
ISBN (Electronic)	978-1-5386-4658-8
DOIs	https://doi.org/10.1109/ICASSP.2018.8462040
Publication status	Published - 2018
Event	2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Calgary, Canada Duration: 15 Apr 2018 → 20 Apr 2018 https://2018.ieeeicassp.org/

Conference

Conference	2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Country/Territory	Canada
City	Calgary
Period	15/04/2018 → 20/04/2018
Internet address	https://2018.ieeeicassp.org/

Series	I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings
ISSN	1520-6149

Keywords

Deep Learning
Deep Neural Networks
Speech Denoising
Speech Enhancement
Speech Intelligibility

Access to Document

10.1109/ICASSP.2018.8462040

https://arxiv.org/pdf/1802.00604.pdf

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

Kolbæk, Morten ; Tan, Zheng-Hua ; Jensen, Jesper. / Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure. 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. IEEE, 2018. pp. 5059-5063 (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings).

@inproceedings{dc072a35848745f19db297ffadb53d4c,

title = "Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure",

abstract = "In this paper we propose a Deep Neural Network (D NN) based Speech Enhancement (SE) system that is designed to maximize an approximation of the Short-Time Objective Intelligibility (STOI) measure. We formalize an approximate-STOI cost function and derive analytical expressions for the gradients required for DNN training and show that these gradients have desirable properties when used together with gradient based optimization techniques. We show through simulation experiments that the proposed SE system achieves large improvements in estimated speech intelligibility, when tested on matched and unmatched natural noise types, at multiple signal-to-noise ratios. Furthermore, we show that the SE system, when trained using an approximate-STOI cost function performs on par with a system trained with a mean square error cost applied to short-time temporal envelopes. Finally, we show that the proposed SE system performs on par with a traditional DNN based Short- Time Spectral Amplitude (STSA) SE system in terms of estimated speech intelligibility. These results are important because they suggest that traditional DNN based STSA SE systems might be optimal in terms of estimated speech intelligibility.",

keywords = "Deep Learning, Deep Neural Networks, Speech Denoising, Speech Enhancement, Speech Intelligibility",

author = "Morten Kolb{\ae}k and Zheng-Hua Tan and Jesper Jensen",

year = "2018",

doi = "10.1109/ICASSP.2018.8462040",

language = "English",

isbn = "9781538646588",

series = "I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings",

publisher = "IEEE",

pages = "5059--5063",

booktitle = "2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings",

address = "United States",

note = "2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP ; Conference date: 15-04-2018 Through 20-04-2018",

url = "https://2018.ieeeicassp.org/",

}

Kolbæk, M, Tan, Z-H & Jensen, J 2018, Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure. in 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings., 8462040, IEEE, I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings, pp. 5059-5063, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, 15/04/2018. https://doi.org/10.1109/ICASSP.2018.8462040

Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure. / Kolbæk, Morten; Tan, Zheng-Hua ; Jensen, Jesper.
2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. IEEE, 2018. p. 5059-5063 8462040 (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings).

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

TY - GEN

T1 - Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

AU - Kolbæk, Morten

AU - Tan, Zheng-Hua

AU - Jensen, Jesper

PY - 2018

Y1 - 2018

N2 - In this paper we propose a Deep Neural Network (D NN) based Speech Enhancement (SE) system that is designed to maximize an approximation of the Short-Time Objective Intelligibility (STOI) measure. We formalize an approximate-STOI cost function and derive analytical expressions for the gradients required for DNN training and show that these gradients have desirable properties when used together with gradient based optimization techniques. We show through simulation experiments that the proposed SE system achieves large improvements in estimated speech intelligibility, when tested on matched and unmatched natural noise types, at multiple signal-to-noise ratios. Furthermore, we show that the SE system, when trained using an approximate-STOI cost function performs on par with a system trained with a mean square error cost applied to short-time temporal envelopes. Finally, we show that the proposed SE system performs on par with a traditional DNN based Short- Time Spectral Amplitude (STSA) SE system in terms of estimated speech intelligibility. These results are important because they suggest that traditional DNN based STSA SE systems might be optimal in terms of estimated speech intelligibility.

AB - In this paper we propose a Deep Neural Network (D NN) based Speech Enhancement (SE) system that is designed to maximize an approximation of the Short-Time Objective Intelligibility (STOI) measure. We formalize an approximate-STOI cost function and derive analytical expressions for the gradients required for DNN training and show that these gradients have desirable properties when used together with gradient based optimization techniques. We show through simulation experiments that the proposed SE system achieves large improvements in estimated speech intelligibility, when tested on matched and unmatched natural noise types, at multiple signal-to-noise ratios. Furthermore, we show that the SE system, when trained using an approximate-STOI cost function performs on par with a system trained with a mean square error cost applied to short-time temporal envelopes. Finally, we show that the proposed SE system performs on par with a traditional DNN based Short- Time Spectral Amplitude (STSA) SE system in terms of estimated speech intelligibility. These results are important because they suggest that traditional DNN based STSA SE systems might be optimal in terms of estimated speech intelligibility.

KW - Deep Learning

KW - Deep Neural Networks

KW - Speech Denoising

KW - Speech Enhancement

KW - Speech Intelligibility

UR - http://www.scopus.com/inward/record.url?scp=85054206412&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2018.8462040

DO - 10.1109/ICASSP.2018.8462040

M3 - Article in proceeding

SN - 9781538646588

T3 - I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings

SP - 5059

EP - 5063

BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings

PB - IEEE

T2 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Y2 - 15 April 2018 through 20 April 2018

ER -

Kolbæk M, Tan Z-H , Jensen J. Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. IEEE. 2018. p. 5059-5063. 8462040. (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings). doi: 10.1109/ICASSP.2018.8462040

Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

Abstract

Conference

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this