Speech inpainting: Context-based speech synthesis guided by video

Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng Hua Tan, Jesper Jensen

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

2 Citations (Scopus)

Abstract

Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audiovisual transformer for speech recognition, are suitable for synthesizing speech.

Original languageEnglish
Title of host publicationProc. INTERSPEECH 2023
Number of pages5
PublisherISCA
Publication date2023
Pages4459-4463
DOIs
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Conference

Conference24th International Speech Communication Association, Interspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/202324/08/2023
SponsorAmazon Science, Apple, Dataocean AI, et al., Google Research, Meta AI
SeriesProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN2308-457X

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

Keywords

  • audio-visual
  • deep learning
  • inpainting
  • multimodal
  • speech
  • transformer

Fingerprint

Dive into the research topics of 'Speech inpainting: Context-based speech synthesis guided by video'. Together they form a unique fingerprint.

Cite this