TY - JOUR
T1 - ACTUAL
T2 - Audio Captioning With Caption Feature Space Regularization
AU - Zhang, Yiming
AU - Yu, Hong
AU - Du, Ruoyi
AU - Tan, Zheng Hua
AU - Wang, Wenwu
AU - Ma, Zhanyu
AU - Dong, Yuan
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2023
Y1 - 2023
N2 - Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.
AB - Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.
KW - Audio captioning
KW - caption consistency regularization
KW - contrastive learning
KW - cross-modal task
UR - http://www.scopus.com/inward/record.url?scp=85164416896&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2023.3293015
DO - 10.1109/TASLP.2023.3293015
M3 - Journal article
AN - SCOPUS:85164416896
SN - 2329-9290
VL - 31
SP - 2643
EP - 2657
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -