The Crisis of Evaluation in Music Information Retrieval

Bob L. Sturm (Foredragsholder)

Institut for Arkitektur og Medieteknologi

Aktivitet: Foredrag og mundtlige bidrag › Foredrag og præsentationer i privat eller offentlig virksomhed

Beskrivelse

I critically address the "crisis of evaluation" in music information retrieval (MIR), with particular emphasis paid to music genre recognition, music mood recognition, and autotagging. I demonstrate four things: 1) many published results unknowingly use datasets with faults that render them meaningless; 2) state-of-the-art ("high classification accuracy") systems are fooled by irrelevant factors; 3) most published results are based upon an invalid evaluation design; and 4) a lot of work has unknowingly built, tuned, tested, compared and advertised "horses" instead of solutions. (The example of the horse Clever Hans provides an appropriate illustration.) I argue these problems occur because: 1) many researchers assume a dataset is a good dataset because many others use it; 2) many researchers assume evaluation that is standard in machine learning or information retrieval are useful and relevant for MIR; 3) many researchers mistake systematic, rigorous, and standardized evaluation for being scientific evaluation; and 4) problems and success criteria remain ill-defined, and thus evaluation poor, because researchers do not define appropriate use cases. I show how this "crisis of evaluation" can be addressed by formalizing evaluation in MIR to make clear its aims, parts, design, execution, interpretation, and assumptions. I also present several alternative evaluation approaches that can separate horses from solutions.

Periode	13 nov. 2013
Sted for afholdelse	Unknown External Organisation

Dokumenter og Links

SturmHorses2
Fil: application/octet-stream, 41,3 MB
Type: Tekstfil

Relateret indhold

Publikationer
Evaluating music emotion recognition: Lessons from music genre recognition?
Publikation: Bidrag til tidsskrift › Konferenceartikel i tidsskrift › Forskning › peer review
Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?
Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review
Classification Accuracy Is Not Enough: On the Evaluation of Music Genre Recognition Systems
Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review
Formalizing Evaluation in Music Information Retrieval: A Look at the MIREX Automatic Mood Classification Task
Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review
An Analysis of the GTZAN Music Genre Dataset
Publikation: Bidrag til bog/antologi/rapport/konference proceeding › Konferenceartikel i proceeding › Forskning › peer review
A Survey of Evaluation in Music Genre Recognition
Publikation: Bidrag til tidsskrift › Konferenceartikel i tidsskrift › Forskning › peer review

Projekter
Greedy Sparse Approximation and the Automatic Description of Audio and Music Data
Projekter: Projekt › Forskning

The Crisis of Evaluation in Music Information Retrieval

Beskrivelse

Dokumenter og Links

Relateret indhold

Publikationer

Evaluating music emotion recognition: Lessons from music genre recognition?

Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?

Classification Accuracy Is Not Enough: On the Evaluation of Music Genre Recognition Systems

Formalizing Evaluation in Music Information Retrieval: A Look at the MIREX Automatic Mood Classification Task

An Analysis of the GTZAN Music Genre Dataset

A Survey of Evaluation in Music Genre Recognition

Projekter

Greedy Sparse Approximation and the Automatic Description of Audio and Music Data