Аннотация
One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (∼3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (∼10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.
| Язык оригинала | English |
|---|---|
| Страницы (с-по) | 904-908 |
| Число страниц | 5 |
| Журнал | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| Том | 2023-August |
| DOI | |
| Состояние | Published - 2023 |
| Опубликовано для внешнего пользования | Да |
| Событие | 24th International Speech Communication Association, Interspeech 2023 - Dublin Продолжительность: авг. 20 2023 → авг. 24 2023 |
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation
Fingerprint
Подробные сведения о темах исследования «Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition». Вместе они формируют уникальный семантический отпечаток (fingerprint).Цитировать
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS