TY - JOUR
T1 - On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition
AU - Mukhamediya, Azamat
AU - Fazli, Siamac
AU - Zollanvari, Amin
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Speech emotion recognition (SER) has become a major area of investigation in human-computer interaction. Conventionally, SER is formulated as a classification problem that follows a common methodology: (i) extracting features from speech signals; and (ii) constructing an emotion classifier using extracted features. With the advent of deep learning, however, the former stage is integrated into the latter. That is to say, deep neural networks (DNNs), which are trained using log-Mel spectrograms (LMS) of audio waveforms, extract discriminative features from LMS. A critical issue, and one that is often overlooked, is that this procedure is done without relating the choice of LMS parameters to the performance of the trained DNN classifiers. It is commonplace in SER studies that practitioners assume some 'usual' values for these parameters and devote major efforts to training and comparing various DNN architectures. In contrast with this common approach, in this work we choose a single lightweight pre-trained architecture, namely, SqueezeNet, and shift our main effort into tuning LMS parameters. Our empirical results using three publicly available SER datasets show that: (i) parameters of LMS can considerably affect the performance of DNNs; and (ii) by tuning LMS parameters, highly competitive classification performance can be achieved. In particular, treating LMS parameters as hyperparameters and tuning them led to 23%, 10%, and 11% improvement in contrast with the use of 'usual' values of LMS parameters in EmoDB, IEMOCAP, and SAVEE datasets, respectively.
AB - Speech emotion recognition (SER) has become a major area of investigation in human-computer interaction. Conventionally, SER is formulated as a classification problem that follows a common methodology: (i) extracting features from speech signals; and (ii) constructing an emotion classifier using extracted features. With the advent of deep learning, however, the former stage is integrated into the latter. That is to say, deep neural networks (DNNs), which are trained using log-Mel spectrograms (LMS) of audio waveforms, extract discriminative features from LMS. A critical issue, and one that is often overlooked, is that this procedure is done without relating the choice of LMS parameters to the performance of the trained DNN classifiers. It is commonplace in SER studies that practitioners assume some 'usual' values for these parameters and devote major efforts to training and comparing various DNN architectures. In contrast with this common approach, in this work we choose a single lightweight pre-trained architecture, namely, SqueezeNet, and shift our main effort into tuning LMS parameters. Our empirical results using three publicly available SER datasets show that: (i) parameters of LMS can considerably affect the performance of DNNs; and (ii) by tuning LMS parameters, highly competitive classification performance can be achieved. In particular, treating LMS parameters as hyperparameters and tuning them led to 23%, 10%, and 11% improvement in contrast with the use of 'usual' values of LMS parameters in EmoDB, IEMOCAP, and SAVEE datasets, respectively.
KW - Log-Mel spectrogram
KW - speech emotion recognition
KW - SqueezeNet
UR - http://www.scopus.com/inward/record.url?scp=85162658157&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85162658157&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3287093
DO - 10.1109/ACCESS.2023.3287093
M3 - Article
AN - SCOPUS:85162658157
SN - 2169-3536
VL - 11
SP - 61950
EP - 61957
JO - IEEE Access
JF - IEEE Access
ER -