The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use

Bob L. Sturm

The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use

Bob L. Sturm

Institut for Arkitektur og Medieteknologi

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

Abstract

The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.

Originalsprog	Engelsk
Tidsskrift	arXiv.org (e-prints)
Sider (fra-til)	1-29
Antal sider	29
Status	Udgivet - 2013

Adgang til dokumentet

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Greedy Sparse Approximation and the Automatic Description of Audio and Music Data
Sturm, B. L.
Technology and Production Independent Postdoc Center for Independent Research
01/01/2012 → …
Projekter: Projekt › Forskning

Citationsformater

@article{95c5358ae73347af938e7fbe87e7da74,

title = "The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use",

abstract = "The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents. ",

author = "Sturm, {Bob L.}",

year = "2013",

language = "English",

pages = "1--29",

journal = "arXiv.org (e-prints)",

publisher = "Cornell University Library",

}

TY - JOUR

T1 - The GTZAN dataset

T2 - Its contents, its faults, their effects on evaluation, and its future use

AU - Sturm, Bob L.

PY - 2013

Y1 - 2013

N2 - The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.

AB - The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.

M3 - Journal article

SP - 1

EP - 29

JO - arXiv.org (e-prints)

JF - arXiv.org (e-prints)

ER -

The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use

Abstract

Adgang til dokumentet

AUB Link

Fingeraftryk

Projekter

Greedy Sparse Approximation and the Automatic Description of Audio and Music Data

Citationsformater