Extended language modeling experiments for Kazakh

Bagdat Myrzakhmetov, Zhanibek Kozhirbayev

Research output: Contribution to journalConference article

Abstract

In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes 1 . Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both n-gram and neural net models compare to word-based LM.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume2303
Publication statusPublished - Jan 1 2018
Event2018 International Workshop on Computational Models in Language and Speech, CMLS 2018 - Kazan, Russian Federation
Duration: Nov 1 2018 → …

Fingerprint

Experiments
Neural networks
Websites

Keywords

  • Kazakh language
  • Language modeling
  • Morph-based models
  • N-gram
  • Neural language models

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Extended language modeling experiments for Kazakh. / Myrzakhmetov, Bagdat; Kozhirbayev, Zhanibek.

In: CEUR Workshop Proceedings, Vol. 2303, 01.01.2018.

Research output: Contribution to journalConference article

@article{1f982a8492b94d8c930c49f83c21d797,
title = "Extended language modeling experiments for Kazakh",
abstract = "In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes 1 . Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both n-gram and neural net models compare to word-based LM.",
keywords = "Kazakh language, Language modeling, Morph-based models, N-gram, Neural language models",
author = "Bagdat Myrzakhmetov and Zhanibek Kozhirbayev",
year = "2018",
month = "1",
day = "1",
language = "English",
volume = "2303",
journal = "CEUR Workshop Proceedings",
issn = "1613-0073",
publisher = "CEUR-WS",

}

TY - JOUR

T1 - Extended language modeling experiments for Kazakh

AU - Myrzakhmetov, Bagdat

AU - Kozhirbayev, Zhanibek

PY - 2018/1/1

Y1 - 2018/1/1

N2 - In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes 1 . Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both n-gram and neural net models compare to word-based LM.

AB - In this article we present dataset for the Kazakh language for the language modeling. It is an analogue of the Penn Treebank dataset for the Kazakh language as we followed all instructions to create it. The main source for our dataset is articles on the web-pages which were primarily written in Kazakh since there are many new articles translated into Kazakh in Kazakhstan. The dataset is publicly available for research purposes 1 . Several experiments were conducted with this dataset. Together with the traditional n-gram models, we created neural network models for the word-based language model (LM). The latter model on the basis of large parameterized long short-term memory (LSTM) shows the best performance. Since the Kazakh language is considered as an agglutinative language and it might have high out-of-vocabulary (OOV) rate on unseen datasets, we also carried on morph-based LM. With regard to experimental results, sub-word based LM is fitted well for Kazakh in both n-gram and neural net models compare to word-based LM.

KW - Kazakh language

KW - Language modeling

KW - Morph-based models

KW - N-gram

KW - Neural language models

UR - http://www.scopus.com/inward/record.url?scp=85060616033&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85060616033&partnerID=8YFLogxK

M3 - Conference article

VL - 2303

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

SN - 1613-0073

ER -