Kazakh text normalization using machine translation approaches

Zhanibek Kozhirbayev, Zhandos Yessenbayev

Research output: Contribution to journalConference articlepeer-review

Abstract

We present herein our work on text normalization applied to user-generated content (UGC) in the Kazakh language collected from Kazakhstani segment of Internet. UGC as a text is notoriously difficult to process due to prompt introduction of neologisms, peculiar spelling, code-switching or transliteration. All of this increases lexical variety, thereby aggravating the most prominent problems of NLP, such as out-of-vocabulary lexica and data sparseness. It has been shown that certain preprocessing, known as lexical normalization or simply normalization, is required for them to work properly. We applied machine translation techniques to normalize Kazakh texts. For this, a parallel corpus was created with a set of aligned sentences in canonical and non-canonical forms. Using these comments, we created the phrase-based statistical machine translation system as a baseline system. Furthermore, we applied word-based sequence-sequence model to the normalization task. The former method shows 21.67 BLEUs on the test set, whereas later one obtained approximately 30 BLEU score.

Original languageEnglish
Pages (from-to)115-122
Number of pages8
JournalCEUR Workshop Proceedings
Volume2780
Publication statusPublished - 2020
Externally publishedYes
Event2020 Computational Models in Language and Speech Workshop, CMLS 2020 - Kazan, Russian Federation
Duration: Nov 12 2020Nov 13 2020

Keywords

  • Sequence-sequence model
  • Text normalization
  • User-generated content

ASJC Scopus subject areas

  • Computer Science(all)

Fingerprint Dive into the research topics of 'Kazakh text normalization using machine translation approaches'. Together they form a unique fingerprint.

Cite this