Document and Word-level Language Identification for Noisy User Generated Text

Zhanibek Kozhirbayev, Zhandos Yessenbayev, Aibek Makazhanov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes-in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.

Original languageEnglish
Title of host publicationIEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538664674
DOIs
Publication statusPublished - Oct 1 2018
Event12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018 - Almaty, Kazakhstan
Duration: Oct 17 2018Oct 19 2018

Publication series

NameIEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings

Conference

Conference12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018
CountryKazakhstan
CityAlmaty
Period10/17/1810/19/18

Fingerprint

Language
Language Arts
Kazakhstan
Learning
Deep learning

Keywords

  • code-switching
  • language identification
  • normalization
  • user generated content

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Information Systems and Management
  • Health Informatics
  • Information Systems

Cite this

Kozhirbayev, Z., Yessenbayev, Z., & Makazhanov, A. (2018). Document and Word-level Language Identification for Noisy User Generated Text. In IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings [8747138] (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICAICT.2018.8747138

Document and Word-level Language Identification for Noisy User Generated Text. / Kozhirbayev, Zhanibek; Yessenbayev, Zhandos; Makazhanov, Aibek.

IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. 8747138 (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kozhirbayev, Z, Yessenbayev, Z & Makazhanov, A 2018, Document and Word-level Language Identification for Noisy User Generated Text. in IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings., 8747138, IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018, Almaty, Kazakhstan, 10/17/18. https://doi.org/10.1109/ICAICT.2018.8747138
Kozhirbayev Z, Yessenbayev Z, Makazhanov A. Document and Word-level Language Identification for Noisy User Generated Text. In IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2018. 8747138. (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings). https://doi.org/10.1109/ICAICT.2018.8747138
Kozhirbayev, Zhanibek ; Yessenbayev, Zhandos ; Makazhanov, Aibek. / Document and Word-level Language Identification for Noisy User Generated Text. IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings).
@inproceedings{c41cf13c55804fe58252a1cbc22c1283,
title = "Document and Word-level Language Identification for Noisy User Generated Text",
abstract = "We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes-in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.",
keywords = "code-switching, language identification, normalization, user generated content",
author = "Zhanibek Kozhirbayev and Zhandos Yessenbayev and Aibek Makazhanov",
year = "2018",
month = "10",
day = "1",
doi = "10.1109/ICAICT.2018.8747138",
language = "English",
series = "IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings",
address = "United States",

}

TY - GEN

T1 - Document and Word-level Language Identification for Noisy User Generated Text

AU - Kozhirbayev, Zhanibek

AU - Yessenbayev, Zhandos

AU - Makazhanov, Aibek

PY - 2018/10/1

Y1 - 2018/10/1

N2 - We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes-in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.

AB - We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes-in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.

KW - code-switching

KW - language identification

KW - normalization

KW - user generated content

UR - http://www.scopus.com/inward/record.url?scp=85070232678&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85070232678&partnerID=8YFLogxK

U2 - 10.1109/ICAICT.2018.8747138

DO - 10.1109/ICAICT.2018.8747138

M3 - Conference contribution

AN - SCOPUS:85070232678

T3 - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings

BT - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -