On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

6 Citationer (Scopus)

Abstrakt

Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

OriginalsprogEngelsk
Titel ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Antal sider5
ForlagIEEE
Publikationsdato17 apr. 2019
Sider8077-8081
Artikelnummer8682790
ISBN (Trykt)978-1-4799-8132-8
ISBN (Elektronisk)978-1-4799-8131-1
DOI
StatusUdgivet - 17 apr. 2019
Begivenhed2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Brighton, Storbritannien
Varighed: 12 maj 201917 maj 2019
https://2019.ieeeicassp.org/

Konference

Konference2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
LandStorbritannien
ByBrighton
Periode12/05/201917/05/2019
Internetadresse
NavnI E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings
ISSN1520-6149

Fingeraftryk Dyk ned i forskningsemnerne om 'On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement'. Sammen danner de et unikt fingeraftryk.

Citationsformater