TY - GEN
T1 - Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning
AU - Zeng, Zhiping
AU - Pham, Van Tung
AU - Xu, Haihua
AU - Khassanov, Yerbolat
AU - Chng, Eng Siong
AU - Ni, Chongjia
AU - Ma, Bin
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/24
Y1 - 2021/1/24
N2 - In this work, we study leveraging extra text data to improve low- resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend the prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.
AB - In this work, we study leveraging extra text data to improve low- resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend the prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.
KW - cross-lingual transfer learning
KW - independent language model
KW - lstm
KW - transformer
KW - unpaired text
UR - http://www.scopus.com/inward/record.url?scp=85102574283&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85102574283&partnerID=8YFLogxK
U2 - 10.1109/ISCSLP49672.2021.9362086
DO - 10.1109/ISCSLP49672.2021.9362086
M3 - Conference contribution
AN - SCOPUS:85102574283
T3 - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
BT - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
Y2 - 24 January 2021 through 27 January 2021
ER -