TY - GEN
T1 - Leveraging Wav2Vec2.0 for Kazakh Speech Recognition
T2 - 24th International Conference on Computational Science and Its Applications, ICCSA 2024
AU - Kozhirbayev, Zhanibek
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - In the fast-growing world of neural networks, models trained on extensive multilingual text and speech data have shown great promise for improving the state of low-resource languages. This study focuses on the application of state-of-the-art speech recognition models, specifically Facebook’s Wav2Vec2.0 and Wav2Vec2-XLSR, to the Kazakh language. The primary objective is to evaluate the performance of these models in transcribing spoken Kazakh content. Additionally, the research explores the possibility of using data from other languages for initial training and examines whether fine-tuning the model with target language data can improve its performance. More so, this work gives insights into how effective pre-trained multilingual models are when used on low-resource languages. The fine-tuned wav2vec2.0-XLSR model demonstrated impressive results, achieving a character error rate (CER) of 1.9 and a word error rate (WER) of 8.9 when tested against the test set of the Kazcorpus dataset. These findings may help create robustness in Automatic Speech Recognition (ASR) systems for Kazakh which could be used for various applications such as voice-activated assistants; speech-to-text translators among others.
AB - In the fast-growing world of neural networks, models trained on extensive multilingual text and speech data have shown great promise for improving the state of low-resource languages. This study focuses on the application of state-of-the-art speech recognition models, specifically Facebook’s Wav2Vec2.0 and Wav2Vec2-XLSR, to the Kazakh language. The primary objective is to evaluate the performance of these models in transcribing spoken Kazakh content. Additionally, the research explores the possibility of using data from other languages for initial training and examines whether fine-tuning the model with target language data can improve its performance. More so, this work gives insights into how effective pre-trained multilingual models are when used on low-resource languages. The fine-tuned wav2vec2.0-XLSR model demonstrated impressive results, achieving a character error rate (CER) of 1.9 and a word error rate (WER) of 8.9 when tested against the test set of the Kazcorpus dataset. These findings may help create robustness in Automatic Speech Recognition (ASR) systems for Kazakh which could be used for various applications such as voice-activated assistants; speech-to-text translators among others.
KW - Automatic speech recognition
KW - Kazakh language
KW - Pre-trained transformer models
KW - Speech representation models
KW - Wav2Vec 2.0
KW - Wav2Vec2-XLSR
UR - http://www.scopus.com/inward/record.url?scp=85200660529&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200660529&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-64608-9_8
DO - 10.1007/978-3-031-64608-9_8
M3 - Conference contribution
AN - SCOPUS:85200660529
SN - 9783031646072
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 120
EP - 132
BT - Computational Science and Its Applications - ICCSA 2024 - 24th International Conference, 2024, Proceedings
A2 - Gervasi, Osvaldo
A2 - Murgante, Beniamino
A2 - Garau, Chiara
A2 - Taniar, David
A2 - C. Rocha, Ana Maria A.
A2 - Faginas Lago, Maria Noelia
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 1 July 2024 through 4 July 2024
ER -