TY - JOUR
T1 - Generating Accurate and Diverse Audio Captions Through Variational Autoencoder Framework
AU - Zhang, Yiming
AU - Du, Ruoyi
AU - Tan, Zheng Hua
AU - Wang, Wenwu
AU - Ma, Zhanyu
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Generating both diverse and accurate descriptions is an essential goal in the audio captioning task. Traditional methods mainly focus on improving the accuracy of the generated captions but ignore their diversity. In contrast, recent methods have considered generating diverse captions for a given audio clip, but with the potential trade-off in caption accuracy. In this work, we propose a new diverse audio captioning method based on a variational autoencoder structure, dubbed AC-VAE, aiming to achieve a better trade-off between the diversity and accuracy of the generated captions. To improve diversity, AC-VAE learns the latent word distribution at each location based on contextual information. To uphold accuracy, AC-VAE incorporates an autoregressive prior module and a global constraint module, which enable precise modeling of word distribution and encourage semantic consistency of captions at the sentence level. We evaluate the proposed AC-VAE on the Clotho dataset. Experimental results show that AC-VAE achieves a better trade-off between diversity and accuracy compared to the state-of-the-art methods.
AB - Generating both diverse and accurate descriptions is an essential goal in the audio captioning task. Traditional methods mainly focus on improving the accuracy of the generated captions but ignore their diversity. In contrast, recent methods have considered generating diverse captions for a given audio clip, but with the potential trade-off in caption accuracy. In this work, we propose a new diverse audio captioning method based on a variational autoencoder structure, dubbed AC-VAE, aiming to achieve a better trade-off between the diversity and accuracy of the generated captions. To improve diversity, AC-VAE learns the latent word distribution at each location based on contextual information. To uphold accuracy, AC-VAE incorporates an autoregressive prior module and a global constraint module, which enable precise modeling of word distribution and encourage semantic consistency of captions at the sentence level. We evaluate the proposed AC-VAE on the Clotho dataset. Experimental results show that AC-VAE achieves a better trade-off between diversity and accuracy compared to the state-of-the-art methods.
KW - Diverse audio captioning
KW - diverse caption generation
KW - variational autoencoder
UR - http://www.scopus.com/inward/record.url?scp=85195393183&partnerID=8YFLogxK
U2 - 10.1109/LSP.2024.3409212
DO - 10.1109/LSP.2024.3409212
M3 - Journal article
AN - SCOPUS:85195393183
SN - 1070-9908
VL - 31
SP - 2520
EP - 2524
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -