Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks

Stuart Cunningham; Harrison Ridley; Jonathan Rex Weinel; Richard Picking

doi:10.1007/s00779-020-01389-0

Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks

Stuart Cunningham, Harrison Ridley, Jonathan Rex Weinel, Richard Picking

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

26 Citationer (Scopus)

Abstract

The field of Music Emotion Recognition has become and established research sub-domain of Music Information Retrieval. Less attention has been directed towards the counterpart domain of Audio Emotion Recognition, which focuses upon detection of emotional stimuli resulting from non-musical sound. By better understanding how sounds provoke emotional responses in an audience, it may be possible to enhance the work of sound designers. The work in this paper uses the International Affective Digital Sounds set. A total of 76 features are extracted from the sounds, spanning the time and frequency domains. The features are then subjected to an initial analysis to determine what level of similarity exists between pairs of features measured using Pearson’s r correlation coefficient before being used as inputs to a multiple regression model to determine their weighting and relative importance. The features are then used as the input to two machine learning approaches: regression modelling and artificial neural networks in order to determine their ability to predict the emotional dimensions of arousal and valence. It was found that a small number of strong correlations exist between the features and that a greater number of features contribute significantly to the predictive power of emotional valence, rather than arousal. Shallow neural networks perform significantly better than a range of regression models and the best performing networks were able to account for 64.4% of the variance in prediction of arousal and 65.4% in the case of valence. These findings are a major improvement over those encountered in the literature. Several extensions of this research are discussed, including work related to improving data sets as well as the modelling processes.

Originalsprog	Engelsk
Tidsskrift	Personal and Ubiquitous Computing
Vol/bind	25
Udgave nummer	4
Sider (fra-til)	637-650
Antal sider	14
ISSN	1617-4909
DOI	https://doi.org/10.1007/s00779-020-01389-0
Status	Udgivet - 2021
Udgivet eksternt	Ja

Adgang til dokumentet

10.1007/s00779-020-01389-0

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

http://www.scopus.com/inward/record.url?scp=85084087695&partnerID=8YFLogxK

Citationsformater

@article{7ea0b3058d664815be06d33da4bb7da9,

title = "Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks",

abstract = "The field of Music Emotion Recognition has become and established research sub-domain of Music Information Retrieval. Less attention has been directed towards the counterpart domain of Audio Emotion Recognition, which focuses upon detection of emotional stimuli resulting from non-musical sound. By better understanding how sounds provoke emotional responses in an audience, it may be possible to enhance the work of sound designers. The work in this paper uses the International Affective Digital Sounds set. A total of 76 features are extracted from the sounds, spanning the time and frequency domains. The features are then subjected to an initial analysis to determine what level of similarity exists between pairs of features measured using Pearson{\textquoteright}s r correlation coefficient before being used as inputs to a multiple regression model to determine their weighting and relative importance. The features are then used as the input to two machine learning approaches: regression modelling and artificial neural networks in order to determine their ability to predict the emotional dimensions of arousal and valence. It was found that a small number of strong correlations exist between the features and that a greater number of features contribute significantly to the predictive power of emotional valence, rather than arousal. Shallow neural networks perform significantly better than a range of regression models and the best performing networks were able to account for 64.4% of the variance in prediction of arousal and 65.4% in the case of valence. These findings are a major improvement over those encountered in the literature. Several extensions of this research are discussed, including work related to improving data sets as well as the modelling processes.",

keywords = "Affect, Arousal, Audio emotion recognition, Audio features, Emotion, IADS, Neural networks, Regression, Valence",

author = "Stuart Cunningham and Harrison Ridley and Weinel, {Jonathan Rex} and Richard Picking",

year = "2021",

doi = "10.1007/s00779-020-01389-0",

language = "English",

volume = "25",

pages = "637--650",

journal = "Personal and Ubiquitous Computing",

issn = "1617-4909",

publisher = "Springer",

number = "4",

}

Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks. / Cunningham, Stuart; Ridley, Harrison; Weinel, Jonathan Rex et al.
I: Personal and Ubiquitous Computing, Bind 25, Nr. 4, 2021, s. 637-650.

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

TY - JOUR

T1 - Supervised machine learning for audio emotion recognition

T2 - Enhancing film sound design using audio features, regression models and artificial neural networks

AU - Cunningham, Stuart

AU - Ridley, Harrison

AU - Weinel, Jonathan Rex

AU - Picking, Richard

PY - 2021

Y1 - 2021

N2 - The field of Music Emotion Recognition has become and established research sub-domain of Music Information Retrieval. Less attention has been directed towards the counterpart domain of Audio Emotion Recognition, which focuses upon detection of emotional stimuli resulting from non-musical sound. By better understanding how sounds provoke emotional responses in an audience, it may be possible to enhance the work of sound designers. The work in this paper uses the International Affective Digital Sounds set. A total of 76 features are extracted from the sounds, spanning the time and frequency domains. The features are then subjected to an initial analysis to determine what level of similarity exists between pairs of features measured using Pearson’s r correlation coefficient before being used as inputs to a multiple regression model to determine their weighting and relative importance. The features are then used as the input to two machine learning approaches: regression modelling and artificial neural networks in order to determine their ability to predict the emotional dimensions of arousal and valence. It was found that a small number of strong correlations exist between the features and that a greater number of features contribute significantly to the predictive power of emotional valence, rather than arousal. Shallow neural networks perform significantly better than a range of regression models and the best performing networks were able to account for 64.4% of the variance in prediction of arousal and 65.4% in the case of valence. These findings are a major improvement over those encountered in the literature. Several extensions of this research are discussed, including work related to improving data sets as well as the modelling processes.

AB - The field of Music Emotion Recognition has become and established research sub-domain of Music Information Retrieval. Less attention has been directed towards the counterpart domain of Audio Emotion Recognition, which focuses upon detection of emotional stimuli resulting from non-musical sound. By better understanding how sounds provoke emotional responses in an audience, it may be possible to enhance the work of sound designers. The work in this paper uses the International Affective Digital Sounds set. A total of 76 features are extracted from the sounds, spanning the time and frequency domains. The features are then subjected to an initial analysis to determine what level of similarity exists between pairs of features measured using Pearson’s r correlation coefficient before being used as inputs to a multiple regression model to determine their weighting and relative importance. The features are then used as the input to two machine learning approaches: regression modelling and artificial neural networks in order to determine their ability to predict the emotional dimensions of arousal and valence. It was found that a small number of strong correlations exist between the features and that a greater number of features contribute significantly to the predictive power of emotional valence, rather than arousal. Shallow neural networks perform significantly better than a range of regression models and the best performing networks were able to account for 64.4% of the variance in prediction of arousal and 65.4% in the case of valence. These findings are a major improvement over those encountered in the literature. Several extensions of this research are discussed, including work related to improving data sets as well as the modelling processes.

KW - Affect

KW - Arousal

KW - Audio emotion recognition

KW - Audio features

KW - Emotion

KW - IADS

KW - Neural networks

KW - Regression

KW - Valence

UR - http://www.scopus.com/inward/record.url?scp=85084087695&partnerID=8YFLogxK

U2 - 10.1007/s00779-020-01389-0

DO - 10.1007/s00779-020-01389-0

M3 - Journal article

SN - 1617-4909

VL - 25

SP - 637

EP - 650

JO - Personal and Ubiquitous Computing

JF - Personal and Ubiquitous Computing

IS - 4

ER -

Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater