Audio xLSTMs: Learning Self-Supervised Audio Representations with xLSTMs

Research output: Working paper/PreprintPreprint

30 Downloads (Pure)

Abstract

While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have also observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM architecture. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not yet been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach to learn audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 20% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters.
Original languageEnglish
PublisherarXiv
Number of pages5
DOIs
Publication statusPublished - 29 Aug 2024

Bibliographical note

Under review at ICASSP 2025. arXiv admin note: text overlap with arXiv:2406.02178

Keywords

  • cs.SD
  • eess.AS

Fingerprint

Dive into the research topics of 'Audio xLSTMs: Learning Self-Supervised Audio Representations with xLSTMs'. Together they form a unique fingerprint.

Cite this