Vocoder-Based Speech Synthesis from Silent Videos

Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

13 Citations (Scopus)
150 Downloads (Pure)

Abstract

Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.

Original languageEnglish
Title of host publicationInterspeech 2020
Number of pages5
Publication date2020
Pages3530-3534
DOIs
Publication statusPublished - 2020
EventInterspeech 2020 - Shanghai, China
Duration: 25 Oct 202029 Oct 2020

Conference

ConferenceInterspeech 2020
Country/TerritoryChina
CityShanghai
Period25/10/202029/10/2020
SeriesProceedings of the International Conference on Spoken Language Processing
ISSN1990-9772

Keywords

  • Deep learning
  • Lip reading
  • Speech synthesis
  • Vocoder

Fingerprint

Dive into the research topics of 'Vocoder-Based Speech Synthesis from Silent Videos'. Together they form a unique fingerprint.

Cite this