Skip to main navigation Skip to search Skip to main content

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

  • Yist Y. Lin
  • , Tao Han
  • , Haihua Xu
  • , Van Tung Pham
  • , Yerbolat Khassanov
  • , Tze Yuang Chong
  • , Yi He
  • , Lu Lu
  • , Zejun Ma

Research output: Contribution to journalConference articlepeer-review

Abstract

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (∼3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (∼10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

Original languageEnglish
Pages (from-to)904-908
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
Publication statusPublished - 2023
Externally publishedYes
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: Aug 20 2023Aug 24 2023

Keywords

  • data augmentation
  • end-to-end
  • random utterance concatenation
  • short video
  • speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition'. Together they form a unique fingerprint.

Cite this