TY - JOUR
T1 - Hand-crafted versus learned representations for audio event detection
AU - Küçükbay, Selver Ezgi
AU - Yazıcı, Adnan
AU - Kalkan, Sinan
N1 - Funding Information:
We would like to thank Türk Telekom Research Center for providing hardware components for the experiments. Dr. Kalkan is supported by the BAGEP Award of the Science Academy, Turkey.
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2022
Y1 - 2022
N2 - Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (∼ 30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.
AB - Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (∼ 30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.
KW - Audio event classification
KW - Audio event detection
KW - Deep learning
KW - Log mel spectogram
KW - Mel spectrogram
KW - MFCC
KW - Spectrogram
UR - http://www.scopus.com/inward/record.url?scp=85127720320&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127720320&partnerID=8YFLogxK
U2 - 10.1007/s11042-022-12873-5
DO - 10.1007/s11042-022-12873-5
M3 - Article
AN - SCOPUS:85127720320
SN - 1380-7501
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
ER -