Audio-Visual Speech Inpainting with Deep Learning

Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

20 Citations (Scopus)

Abstract

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

Original languageEnglish
Title of host publication ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Number of pages5
Volume2021-June
PublisherIEEE
Publication date2021
Pages6653-6657
ISBN (Print)978-1-7281-7606-2
ISBN (Electronic)978-1-7281-7605-5
DOIs
Publication statusPublished - 2021
Event ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Toronto, Canada
Duration: 6 Jun 202111 Jun 2021

Conference

Conference ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Country/TerritoryCanada
CityToronto
Period06/06/202111/06/2021
SeriesI E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings
ISSN1520-6149

Keywords

  • Audio-visual
  • Deep learning
  • Face-landmarks
  • Multi-task learning
  • Speech inpainting

Fingerprint

Dive into the research topics of 'Audio-Visual Speech Inpainting with Deep Learning'. Together they form a unique fingerprint.

Cite this