Перейти к основной навигации Перейти к поиску Перейти к основному содержанию

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

  • Yist Y. Lin
  • , Tao Han
  • , Haihua Xu
  • , Van Tung Pham
  • , Yerbolat Khassanov
  • , Tze Yuang Chong
  • , Yi He
  • , Lu Lu
  • , Zejun Ma
  • ByteDance Ltd.

Результат исследованийрецензирование

1   !!Link opens in a new tab Цитирования (Scopus)

Аннотация

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (∼3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (∼10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

Язык оригиналаEnglish
Страницы (с-по)904-908
Число страниц5
ЖурналProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Том2023-August
DOI
СостояниеPublished - 2023
Опубликовано для внешнего пользованияДа
Событие24th International Speech Communication Association, Interspeech 2023 - Dublin
Продолжительность: авг. 20 2023авг. 24 2023

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Подробные сведения о темах исследования «Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition». Вместе они формируют уникальный семантический отпечаток (fingerprint).

Цитировать