TY - JOUR
T1 - Kazakh text normalization using machine translation approaches
AU - Kozhirbayev, Zhanibek
AU - Yessenbayev, Zhandos
N1 - Funding Information:
This work has been funded by the Ministry of Education and Science of the Republic of Kazakhstan under the research grants No. AP05134272 and No. AP08053085.
Publisher Copyright:
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020
Y1 - 2020
N2 - We present herein our work on text normalization applied to user-generated content (UGC) in the Kazakh language collected from Kazakhstani segment of Internet. UGC as a text is notoriously difficult to process due to prompt introduction of neologisms, peculiar spelling, code-switching or transliteration. All of this increases lexical variety, thereby aggravating the most prominent problems of NLP, such as out-of-vocabulary lexica and data sparseness. It has been shown that certain preprocessing, known as lexical normalization or simply normalization, is required for them to work properly. We applied machine translation techniques to normalize Kazakh texts. For this, a parallel corpus was created with a set of aligned sentences in canonical and non-canonical forms. Using these comments, we created the phrase-based statistical machine translation system as a baseline system. Furthermore, we applied word-based sequence-sequence model to the normalization task. The former method shows 21.67 BLEUs on the test set, whereas later one obtained approximately 30 BLEU score.
AB - We present herein our work on text normalization applied to user-generated content (UGC) in the Kazakh language collected from Kazakhstani segment of Internet. UGC as a text is notoriously difficult to process due to prompt introduction of neologisms, peculiar spelling, code-switching or transliteration. All of this increases lexical variety, thereby aggravating the most prominent problems of NLP, such as out-of-vocabulary lexica and data sparseness. It has been shown that certain preprocessing, known as lexical normalization or simply normalization, is required for them to work properly. We applied machine translation techniques to normalize Kazakh texts. For this, a parallel corpus was created with a set of aligned sentences in canonical and non-canonical forms. Using these comments, we created the phrase-based statistical machine translation system as a baseline system. Furthermore, we applied word-based sequence-sequence model to the normalization task. The former method shows 21.67 BLEUs on the test set, whereas later one obtained approximately 30 BLEU score.
KW - Sequence-sequence model
KW - Text normalization
KW - User-generated content
UR - http://www.scopus.com/inward/record.url?scp=85098220535&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098220535&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85098220535
SN - 1613-0073
VL - 2780
SP - 115
EP - 122
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 2020 Computational Models in Language and Speech Workshop, CMLS 2020
Y2 - 12 November 2020 through 13 November 2020
ER -