Kazakh Text Normalization using Machine Translation Approaches

Zhanibek Kozhirbayev, Zhandos Yessenbayev

    Research output: Contribution to journalConference articlepeer-review

    Abstract

    We present herein our work on text normalization applied to usergenerated content (UGC) in the Kazakh language collected from Kazakhstani
    segment of Internet. UGC as a text is notoriously difficult to process due to
    prompt introduction of neologisms, peculiar spelling, code-switching or transliteration. All of this increases lexical variety, thereby aggravating the most
    prominent problems of NLP, such as out-of-vocabulary lexica and data sparseness. It has been shown that certain preprocessing, known as lexical normalization or simply normalization, is required for them to work properly.
    We applied machine translation techniques to normalize Kazakh texts. For
    this, a parallel corpus was created with a set of aligned sentences in canonical
    and non-canonical forms. Using these comments, we created the phrase-based
    statistical machine translation system as a baseline system. Furthermore, we applied word-based sequence-sequence model to the normalization task. The former method shows 21.67 BLEUs on the test set, whereas later one obtained approximately 30 BLEU score.
    Original languageEnglish
    Article number10
    JournalCEUR Workshop Proceedings
    Volume2780
    Publication statusPublished - 2020

    Keywords

    • Text normalization
    • User-generated content
    • Sequence-sequence model

    Fingerprint

    Dive into the research topics of 'Kazakh Text Normalization using Machine Translation Approaches'. Together they form a unique fingerprint.

    Cite this