Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

Debang Liu, Tianqi Zhang, Mads Græsbøll Christensen, Ying Wei, Zeliang An

Research output: Contribution to journalConference article in JournalResearchpeer-review


Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.

Original languageEnglish
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Pages (from-to)3694-3698
Number of pages5
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023


Conference24th International Speech Communication Association, Interspeech 2023
SponsorAmazon Science, Apple, Dataocean AI, et al., Google Research, Meta AI

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.


  • audio-visual fusion
  • speech separation
  • temporal convolutional attention
  • time-domain
  • training cost


Dive into the research topics of 'Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation'. Together they form a unique fingerprint.

Cite this