On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Daniel Michelsanti; Zheng-Hua Tan; Sigurdur Sigurdsson; Jesper Jensen

doi:10.1109/ICASSP.2019.8682790

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review

19 Citationer (Scopus)

Abstract

Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

Originalsprog	Engelsk
Titel	ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Antal sider	5
Forlag	IEEE
Publikationsdato	17 apr. 2019
Sider	8077-8081
Artikelnummer	8682790
ISBN (Trykt)	978-1-4799-8132-8
ISBN (Elektronisk)	978-1-4799-8131-1
DOI	https://doi.org/10.1109/ICASSP.2019.8682790
Status	Udgivet - 17 apr. 2019
Begivenhed	2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Brighton, Storbritannien Varighed: 12 maj 2019 → 17 maj 2019 https://2019.ieeeicassp.org/

Konference

Konference	2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Land/Område	Storbritannien
By	Brighton
Periode	12/05/2019 → 17/05/2019
Internetadresse	https://2019.ieeeicassp.org/

Navn	I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings
ISSN	1520-6149

Adgang til dokumentet

10.1109/ICASSP.2019.8682790

https://arxiv.org/pdf/1811.06234.pdf

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

http://www.scopus.com/inward/record.url?scp=85068964525&partnerID=8YFLogxK

Citationsformater

Michelsanti, Daniel ; Tan, Zheng-Hua ; Sigurdsson, Sigurdur et al. / On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. s. 8077-8081 (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings).

@inproceedings{6b38ad10b4814001ac61b1f1ba460274,

title = "On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement",

abstract = "Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.",

keywords = "Audio-visual speech enhancement, deep learning, objective functions, training targets",

author = "Daniel Michelsanti and Zheng-Hua Tan and Sigurdur Sigurdsson and Jesper Jensen",

year = "2019",

month = apr,

day = "17",

doi = "10.1109/ICASSP.2019.8682790",

language = "English",

isbn = "978-1-4799-8132-8",

series = "I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings",

publisher = "IEEE",

pages = "8077--8081",

booktitle = "ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

address = "United States",

note = "2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP ; Conference date: 12-05-2019 Through 17-05-2019",

url = "https://2019.ieeeicassp.org/",

}

Michelsanti, D, Tan, Z-H, Sigurdsson, S & Jensen, J 2019, On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement. i ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 8682790, IEEE, I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings, s. 8077-8081, 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, Storbritannien, 12/05/2019. https://doi.org/10.1109/ICASSP.2019.8682790

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement. / Michelsanti, Daniel; Tan, Zheng-Hua; Sigurdsson, Sigurdur et al.
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. s. 8077-8081 8682790 (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings).

Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review

TY - GEN

T1 - On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

AU - Michelsanti, Daniel

AU - Tan, Zheng-Hua

AU - Sigurdsson, Sigurdur

AU - Jensen, Jesper

PY - 2019/4/17

Y1 - 2019/4/17

N2 - Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

AB - Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

KW - Audio-visual speech enhancement

KW - deep learning

KW - objective functions

KW - training targets

UR - http://www.scopus.com/inward/record.url?scp=85068964525&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8682790

DO - 10.1109/ICASSP.2019.8682790

M3 - Article in proceeding

SN - 978-1-4799-8132-8

T3 - I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings

SP - 8077

EP - 8081

BT - ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

T2 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Y2 - 12 May 2019 through 17 May 2019

ER -

Michelsanti D, Tan Z-H, Sigurdsson S, Jensen J. On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement. I ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019. s. 8077-8081. 8682790. (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings). doi: 10.1109/ICASSP.2019.8682790

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Abstract

Konference

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater