Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

Debang Liu; Tianqi Zhang; Mads Græsbøll Christensen; Ying Wei; Zeliang An

doi:10.21437/Interspeech.2023-801

Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

Debang Liu, Tianqi Zhang, Mads Græsbøll Christensen, Ying Wei, Zeliang An

Publikation: Bidrag til tidsskrift › Konferenceartikel i tidsskrift › Forskning › peer review

Abstract

Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.

Originalsprog	Engelsk
Tidsskrift	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Vol/bind	2023-August
Sider (fra-til)	3694-3698
Antal sider	5
ISSN	2308-457X
DOI	https://doi.org/10.21437/Interspeech.2023-801
Status	Udgivet - 2023
Begivenhed	24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland Varighed: 20 aug. 2023 → 24 aug. 2023

Konference

Konference	24th International Speech Communication Association, Interspeech 2023
Land/Område	Irland
By	Dublin
Periode	20/08/2023 → 24/08/2023
Sponsor	Amazon Science, Apple, Dataocean AI, et al., Google Research, Meta AI

Bibliografisk note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

Adgang til dokumentet

10.21437/Interspeech.2023-801

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

Link to publication in Scopus

Citationsformater

@inproceedings{2693ba9f90054cee92ffee4d83e2f224,

title = "Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation",

abstract = "Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.",

keywords = "audio-visual fusion, speech separation, temporal convolutional attention, time-domain, training cost",

author = "Debang Liu and Tianqi Zhang and Christensen, {Mads Gr{\ae}sb{\o}ll} and Ying Wei and Zeliang An",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-801",

language = "English",

volume = "2023-August",

pages = "3694--3698",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation. / Liu, Debang; Zhang, Tianqi; Christensen, Mads Græsbøll et al.
I: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Bind 2023-August, 2023, s. 3694-3698.

Publikation: Bidrag til tidsskrift › Konferenceartikel i tidsskrift › Forskning › peer review

TY - GEN

T1 - Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

AU - Liu, Debang

AU - Zhang, Tianqi

AU - Christensen, Mads Græsbøll

AU - Wei, Ying

AU - An, Zeliang

PY - 2023

Y1 - 2023

N2 - Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.

AB - Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.

KW - audio-visual fusion

KW - speech separation

KW - temporal convolutional attention

KW - time-domain

KW - training cost

UR - http://www.scopus.com/inward/record.url?scp=85171572615&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-801

DO - 10.21437/Interspeech.2023-801

M3 - Conference article in Journal

AN - SCOPUS:85171572615

SN - 2308-457X

VL - 2023-August

SP - 3694

EP - 3698

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation

Abstract

Konference

Bibliografisk note

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater