Extended language modeling experiments for Kazakh

Bagdat Myrzakhmetov, Zhanibek Kozhirbayev

Research output: Contribution to journalConference articlepeer-review

4 Citations (Scopus)


In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes 1 . Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both n-gram and neural net models compare to word-based LM.

Original languageEnglish
JournalCEUR Workshop Proceedings
Publication statusPublished - Jan 1 2018
Event2018 International Workshop on Computational Models in Language and Speech, CMLS 2018 - Kazan, Russian Federation
Duration: Nov 1 2018 → …


  • Kazakh language
  • Language modeling
  • Morph-based models
  • N-gram
  • Neural language models

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'Extended language modeling experiments for Kazakh'. Together they form a unique fingerprint.

Cite this