Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

Debang Liu, Tianqi Zhang, Mads Græsbøll Christensen, Ying Wei, Zeliang An

Publikation: Bidrag til tidsskriftKonferenceartikel i tidsskriftForskningpeer review

Abstract

Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.

OriginalsprogEngelsk
TidsskriftProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Vol/bind2023-August
Sider (fra-til)3694-3698
Antal sider5
ISSN2308-457X
DOI
StatusUdgivet - 2023
Begivenhed24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland
Varighed: 20 aug. 202324 aug. 2023

Konference

Konference24th International Speech Communication Association, Interspeech 2023
Land/OmrådeIrland
ByDublin
Periode20/08/202324/08/2023
SponsorAmazon Science, Apple, Dataocean AI, et al., Google Research, Meta AI

Bibliografisk note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation'. Sammen danner de et unikt fingeraftryk.

Citationsformater