Abstract
Audio-only speech separation methods cannot fully exploit audio-visual correlation information of speaker, which limits separation performance. Additionally, audio-visual separation methods usually adopt traditional idea of feature splicing and linear mapping to fuse audio-visual features, this approach requires us to think more about fusion process. Therefore, in this paper, combining with the changes of speaker mouth landmarks, we propose a time-domain audio-visual temporal convolution attention speech separation method (AVTA). In AVTA, we design a multiscale temporal convolutional attention (MTCA) to better focus on contextual dependencies of time sequences. We then use sequence learning and fusion network composed of MTCA to build a separation model for speech separation task. On different datasets, AVTA achieves competitive performance, and compared to baseline methods, AVTA is better balanced in training cost, computational complexity and separation performance.
Originalsprog | Engelsk |
---|---|
Tidsskrift | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Vol/bind | 2023-August |
Sider (fra-til) | 3694-3698 |
Antal sider | 5 |
ISSN | 2308-457X |
DOI | |
Status | Udgivet - 2023 |
Begivenhed | 24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland Varighed: 20 aug. 2023 → 24 aug. 2023 |
Konference
Konference | 24th International Speech Communication Association, Interspeech 2023 |
---|---|
Land/Område | Irland |
By | Dublin |
Periode | 20/08/2023 → 24/08/2023 |
Sponsor | Amazon Science, Apple, Dataocean AI, et al., Google Research, Meta AI |
Bibliografisk note
Publisher Copyright:© 2023 International Speech Communication Association. All rights reserved.