Kazakh Text Normalization using Machine Translation Approaches

Research output: Contribution to journalConference articlepeer-review

Abstract

We present herein our work on text normalization applied to usergenerated content (UGC) in the Kazakh language collected from Kazakhstani
segment of Internet. UGC as a text is notoriously difficult to process due to
prompt introduction of neologisms, peculiar spelling, code-switching or transliteration. All of this increases lexical variety, thereby aggravating the most
prominent problems of NLP, such as out-of-vocabulary lexica and data sparseness. It has been shown that certain preprocessing, known as lexical normalization or simply normalization, is required for them to work properly.
We applied machine translation techniques to normalize Kazakh texts. For
this, a parallel corpus was created with a set of aligned sentences in canonical
and non-canonical forms. Using these comments, we created the phrase-based
statistical machine translation system as a baseline system. Furthermore, we applied word-based sequence-sequence model to the normalization task. The former method shows 21.67 BLEUs on the test set, whereas later one obtained approximately 30 BLEU score.
Original languageEnglish
Article number10
JournalCEUR Workshop Proceedings
Volume2780
Publication statusPublished - 2020

Keywords

  • Text normalization
  • User-generated content
  • Sequence-sequence model

Fingerprint Dive into the research topics of 'Kazakh Text Normalization using Machine Translation Approaches'. Together they form a unique fingerprint.

Cite this