On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

19 Citations (Scopus)

Abstract

Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

Original languageEnglish
Title of host publication ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Number of pages5
PublisherIEEE
Publication date17 Apr 2019
Pages8077-8081
Article number8682790
ISBN (Print)978-1-4799-8132-8
ISBN (Electronic)978-1-4799-8131-1
DOIs
Publication statusPublished - 17 Apr 2019
Event2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Brighton, United Kingdom
Duration: 12 May 201917 May 2019
https://2019.ieeeicassp.org/

Conference

Conference2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Country/TerritoryUnited Kingdom
CityBrighton
Period12/05/201917/05/2019
Internet address
SeriesI E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings
ISSN1520-6149

Keywords

  • Audio-visual speech enhancement
  • deep learning
  • objective functions
  • training targets

Fingerprint

Dive into the research topics of 'On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement'. Together they form a unique fingerprint.

Cite this