Audio-Visual Speech Inpainting with Deep Learning

Giovanni Morrone; Daniel Michelsanti; Zheng-Hua Tan; Jesper Jensen

doi:10.1109/ICASSP39728.2021.9413488

Audio-Visual Speech Inpainting with Deep Learning

Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

20 Citations (Scopus)

Abstract

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

Original language	English
Title of host publication	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Number of pages	5
Volume	2021-June
Publisher	IEEE
Publication date	2021
Pages	6653-6657
ISBN (Print)	978-1-7281-7606-2
ISBN (Electronic)	978-1-7281-7605-5
DOIs	https://doi.org/10.1109/ICASSP39728.2021.9413488
Publication status	Published - 2021
Event	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021

Conference

Conference	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Country/Territory	Canada
City	Toronto
Period	06/06/2021 → 11/06/2021

Series	I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings
ISSN	1520-6149

Keywords

Audio-visual
Deep learning
Face-landmarks
Multi-task learning
Speech inpainting

Access to Document

10.1109/ICASSP39728.2021.9413488

https://arxiv.org/pdf/2010.04556.pdf

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@inproceedings{8b8907e850594edd8d3649b662543c0a,

title = "Audio-Visual Speech Inpainting with Deep Learning",

abstract = "In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.",

keywords = "Audio-visual, Deep learning, Face-landmarks, Multi-task learning, Speech inpainting",

author = "Giovanni Morrone and Daniel Michelsanti and Zheng-Hua Tan and Jesper Jensen",

year = "2021",

doi = "10.1109/ICASSP39728.2021.9413488",

language = "English",

isbn = "978-1-7281-7606-2",

volume = "2021-June",

series = "I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings",

publisher = "IEEE",

pages = "6653--6657",

booktitle = "ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

address = "United States",

note = " ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ; Conference date: 06-06-2021 Through 11-06-2021",

}

Morrone, G, Michelsanti, D, Tan, Z-H & Jensen, J 2021, Audio-Visual Speech Inpainting with Deep Learning. in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. 2021-June, IEEE, I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings, pp. 6653-6657, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Ontario, Canada, 06/06/2021. https://doi.org/10.1109/ICASSP39728.2021.9413488

Audio-Visual Speech Inpainting with Deep Learning. / Morrone, Giovanni; Michelsanti, Daniel; Tan, Zheng-Hua et al.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. 2021-June IEEE, 2021. p. 6653-6657 (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings).

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

TY - GEN

T1 - Audio-Visual Speech Inpainting with Deep Learning

AU - Morrone, Giovanni

AU - Michelsanti, Daniel

AU - Tan, Zheng-Hua

AU - Jensen, Jesper

PY - 2021

Y1 - 2021

N2 - In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

AB - In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

KW - Audio-visual

KW - Deep learning

KW - Face-landmarks

KW - Multi-task learning

KW - Speech inpainting

UR - http://www.scopus.com/inward/record.url?scp=85109062184&partnerID=8YFLogxK

U2 - 10.1109/ICASSP39728.2021.9413488

DO - 10.1109/ICASSP39728.2021.9413488

M3 - Article in proceeding

SN - 978-1-7281-7606-2

VL - 2021-June

T3 - I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings

SP - 6653

EP - 6657

BT - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Y2 - 6 June 2021 through 11 June 2021

ER -

Morrone G, Michelsanti D, Tan Z-H , Jensen J. Audio-Visual Speech Inpainting with Deep Learning. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. 2021-June. IEEE. 2021. p. 6653-6657. (I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings). doi: 10.1109/ICASSP39728.2021.9413488

Audio-Visual Speech Inpainting with Deep Learning

Abstract

Conference

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this