On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Morten Kolbæk; Zheng-Hua Tan; Søren Holdt Jensen; Jesper Jensen

doi:10.1109/TASLP.2020.2968738

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Morten Kolbæk, Zheng-Hua Tan, Søren Holdt Jensen, Jesper Jensen

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

89 Citationer (Scopus)

192 Downloads (Pure)

Abstract

Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of time-domain deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.

Originalsprog	Engelsk
Artikelnummer	8966946
Tidsskrift	IEEE/ACM Transactions on Audio, Speech, and Language Processing
Vol/bind	28
Sider (fra-til)	825-838
Antal sider	14
ISSN	2329-9290
DOI	https://doi.org/10.1109/TASLP.2020.2968738
Status	Udgivet - 23 jan. 2020

Adgang til dokumentet

10.1109/TASLP.2020.2968738

Accepted author manuscriptAccepteret manuskript, 3,22 MB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

http://www.scopus.com/inward/record.url?scp=85080954873&partnerID=8YFLogxK

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement
Kolbæk, M. (Ophavsperson), VBN, 1 jan. 2020
DOI: 10.5278/257bf91d-bfdf-4414-8e73-e5f4eb9ce69e
Datasæt

Fil

Citationsformater

@article{fcb0de6edb064e1d8b4fa13f91e83208,

title = "On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement",

abstract = "Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of time-domain deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.",

keywords = "Speech enhancement, fully convolutional neural networks, objective intelligibility, time-domain",

author = "Morten Kolb{\ae}k and Zheng-Hua Tan and Jensen, {S{\o}ren Holdt} and Jesper Jensen",

year = "2020",

month = jan,

day = "23",

doi = "10.1109/TASLP.2020.2968738",

language = "English",

volume = "28",

pages = "825--838",

journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",

issn = "2329-9290",

publisher = "IEEE Signal Processing Society",

}

TY - JOUR

T1 - On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

AU - Kolbæk, Morten

AU - Tan, Zheng-Hua

AU - Jensen, Søren Holdt

AU - Jensen, Jesper

PY - 2020/1/23

Y1 - 2020/1/23

N2 - Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of time-domain deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.

AB - Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of time-domain deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.

KW - Speech enhancement

KW - fully convolutional neural networks

KW - objective intelligibility

KW - time-domain

UR - http://www.scopus.com/inward/record.url?scp=85080954873&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2020.2968738

DO - 10.1109/TASLP.2020.2968738

M3 - Journal article

SN - 2329-9290

VL - 28

SP - 825

EP - 838

JO - IEEE/ACM Transactions on Audio, Speech, and Language Processing

JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing

M1 - 8966946

ER -

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Forskningsdatasæt

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Citationsformater