An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

Abstract

Audio tagging (AT) refers to automatically identifying whether a particular sound event is contained in a given audio segment. Sound event detection (SED) requires a system to further determine the time, when exactly an audio event occurs within the audio segment. Task 4 in the DCASE 2017 competition required to solve both tasks automatically based on a set of 17 sounds (horn, siren, car, bicycle, etc.) relevant for smart cars, a subset of the weakly-labeled dataset called the AudioSet. We propose the Xception - Stacked Residual Recurrent Neural Network (XRRNN), based on modifications of the system CVSSP by Xu et al. (2017), that won the challenge for the AT task. The processing stages of the XRRNN consists of 1) an Xception module as front-end, 2) a 1 x 1 convolution, 3) a set of stacked residual recurrent neural networks, and 4) a feed-forward layer with attention. Using log-Mel spectra and MFCCs as input features and a fusion of the posteriors of trained networks with those input features, we yield the following results through a set of Bonferroni-corrected t-tests using 30 models for each configuration: For AT, XRRNN significantly outperforms the CVSSP system with a 1.3% improvement (p = 0.0323) in F-score (XRNN-logMel vs CVSSP-fusion). For SED, for all three input feature combinations, XRRNN significantly reduces the error rate by 4.5% on average (average p = 1.06 · 1010).
Close

Details

Audio tagging (AT) refers to automatically identifying whether a particular sound event is contained in a given audio segment. Sound event detection (SED) requires a system to further determine the time, when exactly an audio event occurs within the audio segment. Task 4 in the DCASE 2017 competition required to solve both tasks automatically based on a set of 17 sounds (horn, siren, car, bicycle, etc.) relevant for smart cars, a subset of the weakly-labeled dataset called the AudioSet. We propose the Xception - Stacked Residual Recurrent Neural Network (XRRNN), based on modifications of the system CVSSP by Xu et al. (2017), that won the challenge for the AT task. The processing stages of the XRRNN consists of 1) an Xception module as front-end, 2) a 1 x 1 convolution, 3) a set of stacked residual recurrent neural networks, and 4) a feed-forward layer with attention. Using log-Mel spectra and MFCCs as input features and a fusion of the posteriors of trained networks with those input features, we yield the following results through a set of Bonferroni-corrected t-tests using 30 models for each configuration: For AT, XRRNN significantly outperforms the CVSSP system with a 1.3% improvement (p = 0.0323) in F-score (XRNN-logMel vs CVSSP-fusion). For SED, for all three input feature combinations, XRRNN significantly reduces the error rate by 4.5% on average (average p = 1.06 · 1010).
Original languageEnglish
Title of host publicationSound and Music Computing Conference 2018
PublisherZenodo
Publication date2018
DOI
Publication statusPublished - 2018
Publication categoryResearch
Peer-reviewedYes
Event15th International Sound & Music Computing Conference - Limassol, Cyprus
Duration: 4 Jul 2018 → …

Conference

Conference15th International Sound & Music Computing Conference
LandCyprus
ByLimassol
Periode04/07/2018 → …

Map

Download statistics

No data available
ID: 286888873