Kazakh text normalization using machine translation approaches

Zhanibek Kozhirbayev, Zhandos Yessenbayev

    Research output: Contribution to journalConference articlepeer-review

    Abstract

    We present herein our work on text normalization applied to user-generated content (UGC) in the Kazakh language collected from Kazakhstani segment of Internet. UGC as a text is notoriously difficult to process due to prompt introduction of neologisms, peculiar spelling, code-switching or transliteration. All of this increases lexical variety, thereby aggravating the most prominent problems of NLP, such as out-of-vocabulary lexica and data sparseness. It has been shown that certain preprocessing, known as lexical normalization or simply normalization, is required for them to work properly. We applied machine translation techniques to normalize Kazakh texts. For this, a parallel corpus was created with a set of aligned sentences in canonical and non-canonical forms. Using these comments, we created the phrase-based statistical machine translation system as a baseline system. Furthermore, we applied word-based sequence-sequence model to the normalization task. The former method shows 21.67 BLEUs on the test set, whereas later one obtained approximately 30 BLEU score.

    Original languageEnglish
    Pages (from-to)115-122
    Number of pages8
    JournalCEUR Workshop Proceedings
    Volume2780
    Publication statusPublished - 2020
    Event2020 Computational Models in Language and Speech Workshop, CMLS 2020 - Kazan, Russian Federation
    Duration: Nov 12 2020Nov 13 2020

    Keywords

    • Sequence-sequence model
    • Text normalization
    • User-generated content

    ASJC Scopus subject areas

    • General Computer Science

    Fingerprint

    Dive into the research topics of 'Kazakh text normalization using machine translation approaches'. Together they form a unique fingerprint.

    Cite this