TY - JOUR
T1 - Extended language modeling experiments for Kazakh
AU - Myrzakhmetov, Bagdat
AU - Kozhirbayev, Zhanibek
N1 - Funding Information:
This work has been funded by the Nazarbayev University under the research grant No129-2017/022-2017 and by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272.
Publisher Copyright:
© 2018 CEUR-WS. All rights reserved.
PY - 2018/1/1
Y1 - 2018/1/1
N2 -
In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes
1
. Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both n-gram and neural net models compare to word-based LM.
AB -
In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes
1
. Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both n-gram and neural net models compare to word-based LM.
KW - Kazakh language
KW - Language modeling
KW - Morph-based models
KW - N-gram
KW - Neural language models
UR - http://www.scopus.com/inward/record.url?scp=85060616033&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85060616033&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85060616033
SN - 1613-0073
VL - 2303
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 2018 International Workshop on Computational Models in Language and Speech, CMLS 2018
Y2 - 1 November 2018
ER -