An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging

Tomas Gajarsky; Hendrik Purwins

doi:10.5281/zenodo.1422563

An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging

Tomas Gajarsky, Hendrik Purwins

Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review

6 Citationer (Scopus)

238 Downloads (Pure)

Abstract

Audio tagging (AT) refers to automatically identifying whether a particular sound event is contained in a given audio segment. Sound event detection (SED) requires a system to further determine the time, when exactly an audio event occurs within the audio segment. Task 4 in the DCASE 2017 competition required to solve both tasks automatically based on a set of 17 sounds (horn, siren, car, bicycle, etc.) relevant for smart cars, a subset of the weakly-labeled dataset called the AudioSet. We propose the Xception - Stacked Residual Recurrent Neural Network (XRRNN), based on modifications of the system CVSSP by Xu et al. (2017), that won the challenge for the AT task. The processing stages of the XRRNN consists of 1) an Xception module as front-end, 2) a 1 x 1 convolution, 3) a set of stacked residual recurrent neural networks, and 4) a feed-forward layer with attention. Using log-Mel spectra and MFCCs as input features and a fusion of the posteriors of trained networks with those input features, we yield the following results through a set of Bonferroni-corrected t-tests using 30 models for each configuration: For AT, XRRNN significantly outperforms the CVSSP system with a 1.3% improvement (p = 0.0323) in F-score (XRNN-logMel vs CVSSP-fusion). For SED, for all three input feature combinations, XRRNN significantly reduces the error rate by 4.5% on average (average p = 1.06 · 1010).

Originalsprog	Engelsk
Titel	Proceedings of the 15th Sound and Music Computing Conference (SMC2018)
Forlag	Sound and Music Computing Network
Publikationsdato	2018
Sider	210-216
ISBN (Elektronisk)	978-9963-697-30-4
DOI	https://doi.org/10.5281/zenodo.1422563
Status	Udgivet - 2018
Begivenhed	15th International Sound & Music Computing Conference - Limassol, Cypern Varighed: 4 jul. 2018 → …

Konference

Konference	15th International Sound & Music Computing Conference
Land/Område	Cypern
By	Limassol
Periode	04/07/2018 → …

Navn	Proceedings of the Sound and Music Computing Conference
ISSN	2518-3672

Adgang til dokumentet

10.5281/zenodo.1422563

GajarskyAEDSMC18Forlagets udgivne version, 233 KB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Citationsformater

@inproceedings{abc6024502fc44f99e4a98c7cf25f38f,

title = "An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging",

abstract = "Audio tagging (AT) refers to automatically identifying whether a particular sound event is contained in a given audio segment. Sound event detection (SED) requires a system to further determine the time, when exactly an audio event occurs within the audio segment. Task 4 in the DCASE 2017 competition required to solve both tasks automatically based on a set of 17 sounds (horn, siren, car, bicycle, etc.) relevant for smart cars, a subset of the weakly-labeled dataset called the AudioSet. We propose the Xception - Stacked Residual Recurrent Neural Network (XRRNN), based on modifications of the system CVSSP by Xu et al. (2017), that won the challenge for the AT task. The processing stages of the XRRNN consists of 1) an Xception module as front-end, 2) a 1 x 1 convolution, 3) a set of stacked residual recurrent neural networks, and 4) a feed-forward layer with attention. Using log-Mel spectra and MFCCs as input features and a fusion of the posteriors of trained networks with those input features, we yield the following results through a set of Bonferroni-corrected t-tests using 30 models for each configuration: For AT, XRRNN significantly outperforms the CVSSP system with a 1.3% improvement (p = 0.0323) in F-score (XRNN-logMel vs CVSSP-fusion). For SED, for all three input feature combinations, XRRNN significantly reduces the error rate by 4.5% on average (average p = 1.06 · 1010).",

author = "Tomas Gajarsky and Hendrik Purwins",

year = "2018",

doi = "10.5281/zenodo.1422563",

language = "English",

series = "Proceedings of the Sound and Music Computing Conference",

publisher = "Sound and Music Computing Network",

pages = "210--216",

booktitle = "Proceedings of the 15th Sound and Music Computing Conference (SMC2018)",

note = "15th International Sound & Music Computing Conference ; Conference date: 04-07-2018",

}

Gajarsky, T & Purwins, H 2018, An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging. i Proceedings of the 15th Sound and Music Computing Conference (SMC2018). Sound and Music Computing Network, Proceedings of the Sound and Music Computing Conference, s. 210-216, 15th International Sound & Music Computing Conference, Limassol, Cypern, 04/07/2018. https://doi.org/10.5281/zenodo.1422563

An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging. / Gajarsky, Tomas; Purwins, Hendrik.
Proceedings of the 15th Sound and Music Computing Conference (SMC2018). Sound and Music Computing Network, 2018. s. 210-216 (Proceedings of the Sound and Music Computing Conference).

Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review

TY - GEN

T1 - An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging

AU - Gajarsky, Tomas

AU - Purwins, Hendrik

PY - 2018

Y1 - 2018

N2 - Audio tagging (AT) refers to automatically identifying whether a particular sound event is contained in a given audio segment. Sound event detection (SED) requires a system to further determine the time, when exactly an audio event occurs within the audio segment. Task 4 in the DCASE 2017 competition required to solve both tasks automatically based on a set of 17 sounds (horn, siren, car, bicycle, etc.) relevant for smart cars, a subset of the weakly-labeled dataset called the AudioSet. We propose the Xception - Stacked Residual Recurrent Neural Network (XRRNN), based on modifications of the system CVSSP by Xu et al. (2017), that won the challenge for the AT task. The processing stages of the XRRNN consists of 1) an Xception module as front-end, 2) a 1 x 1 convolution, 3) a set of stacked residual recurrent neural networks, and 4) a feed-forward layer with attention. Using log-Mel spectra and MFCCs as input features and a fusion of the posteriors of trained networks with those input features, we yield the following results through a set of Bonferroni-corrected t-tests using 30 models for each configuration: For AT, XRRNN significantly outperforms the CVSSP system with a 1.3% improvement (p = 0.0323) in F-score (XRNN-logMel vs CVSSP-fusion). For SED, for all three input feature combinations, XRRNN significantly reduces the error rate by 4.5% on average (average p = 1.06 · 1010).

AB - Audio tagging (AT) refers to automatically identifying whether a particular sound event is contained in a given audio segment. Sound event detection (SED) requires a system to further determine the time, when exactly an audio event occurs within the audio segment. Task 4 in the DCASE 2017 competition required to solve both tasks automatically based on a set of 17 sounds (horn, siren, car, bicycle, etc.) relevant for smart cars, a subset of the weakly-labeled dataset called the AudioSet. We propose the Xception - Stacked Residual Recurrent Neural Network (XRRNN), based on modifications of the system CVSSP by Xu et al. (2017), that won the challenge for the AT task. The processing stages of the XRRNN consists of 1) an Xception module as front-end, 2) a 1 x 1 convolution, 3) a set of stacked residual recurrent neural networks, and 4) a feed-forward layer with attention. Using log-Mel spectra and MFCCs as input features and a fusion of the posteriors of trained networks with those input features, we yield the following results through a set of Bonferroni-corrected t-tests using 30 models for each configuration: For AT, XRRNN significantly outperforms the CVSSP system with a 1.3% improvement (p = 0.0323) in F-score (XRNN-logMel vs CVSSP-fusion). For SED, for all three input feature combinations, XRRNN significantly reduces the error rate by 4.5% on average (average p = 1.06 · 1010).

U2 - 10.5281/zenodo.1422563

DO - 10.5281/zenodo.1422563

M3 - Article in proceeding

T3 - Proceedings of the Sound and Music Computing Conference

SP - 210

EP - 216

BT - Proceedings of the 15th Sound and Music Computing Conference (SMC2018)

PB - Sound and Music Computing Network

T2 - 15th International Sound & Music Computing Conference

Y2 - 4 July 2018

ER -

An Xception Residual Recurrent Neural Network for Audio Event Detection and Tagging

Abstract

Konference

Adgang til dokumentet

AUB Link

Fingeraftryk

Citationsformater