Speech inpainting: Context-based speech synthesis guided by video

Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng Hua Tan, Jesper Jensen

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

2 Citationer (Scopus)

Abstract

Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audiovisual transformer for speech recognition, are suitable for synthesizing speech.

OriginalsprogEngelsk
TitelProc. INTERSPEECH 2023
Antal sider5
ForlagISCA
Publikationsdato2023
Sider4459-4463
DOI
StatusUdgivet - 2023
Begivenhed24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland
Varighed: 20 aug. 202324 aug. 2023

Konference

Konference24th International Speech Communication Association, Interspeech 2023
Land/OmrådeIrland
ByDublin
Periode20/08/202324/08/2023
SponsorAmazon Science, Apple, Dataocean AI, et al., Google Research, Meta AI
NavnProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN2308-457X

Bibliografisk note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Speech inpainting: Context-based speech synthesis guided by video'. Sammen danner de et unikt fingeraftryk.

Citationsformater