Hand-crafted versus learned representations for audio event detection

Selver Ezgi Küçükbay, Adnan Yazıcı, Sinan Kalkan

Research output: Contribution to journalArticlepeer-review

Abstract

Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (∼ 30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.

Original languageEnglish
JournalMultimedia Tools and Applications
DOIs
Publication statusAccepted/In press - 2022

Keywords

  • Audio event classification
  • Audio event detection
  • Deep learning
  • Log mel spectogram
  • Mel spectrogram
  • MFCC
  • Spectrogram

ASJC Scopus subject areas

  • Software
  • Media Technology
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Hand-crafted versus learned representations for audio event detection'. Together they form a unique fingerprint.

Cite this