Zero-Shot Audio Captioning Using Soft and Hard Prompts

Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma*

*Kontaktforfatter

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

29 Downloads (Pure)

Abstract

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, and this issue has received little attention. To address these issues, we propose a new zero-shot method for audio captioning. Our method is built on the contrastive language-audio pre-training (CLAP) model. During training, the model reconstructs the ground-truth caption using the CLAP text encoder. In the inference stage, the model generates text descriptions from the CLAP audio embeddings of given audio inputs. To enhance the ability of the model in transitioning from text-to-text generation to audio-to-text generation, we propose to use the mixed-augmentations-based soft prompt to learn more robust latent representations, leveraging instance replacement and embedding augmentation. Additionally, we introduce the retrieval-based acoustic-aware hard prompt to improve the cross-domain performance of the model by employing the domain-agnostic label information of sound events. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method. The code is publicly available at https://github.com/XinMing0411/zero-shot-AAC.
OriginalsprogEngelsk
Artikelnummer10990029
TidsskriftIEEE Transactions on Audio, Speech and Language Processing
Vol/bind33
Sider (fra-til)2045 - 2058
Antal sider14
ISSN2998-4173
DOI
StatusUdgivet - 2025

Fingeraftryk

Dyk ned i forskningsemnerne om 'Zero-Shot Audio Captioning Using Soft and Hard Prompts'. Sammen danner de et unikt fingeraftryk.

Citationsformater