TY - GEN
T1 - Document and Word-level Language Identification for Noisy User Generated Text
AU - Kozhirbayev, Zhanibek
AU - Yessenbayev, Zhandos
AU - Makazhanov, Aibek
N1 - Funding Information:
ACKNOWLEDGMENT This work has been supported by Nazarbayev University research grant 129-2017/022-2017 and the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272.
PY - 2018/10
Y1 - 2018/10
N2 - We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes-in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.
AB - We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes-in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.
KW - code-switching
KW - language identification
KW - normalization
KW - user generated content
UR - http://www.scopus.com/inward/record.url?scp=85070232678&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85070232678&partnerID=8YFLogxK
U2 - 10.1109/ICAICT.2018.8747138
DO - 10.1109/ICAICT.2018.8747138
M3 - Conference contribution
AN - SCOPUS:85070232678
T3 - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings
BT - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018
Y2 - 17 October 2018 through 19 October 2018
ER -