Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.

Original languageEnglish
Title of host publicationIEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538664674
DOIs
Publication statusPublished - Oct 1 2018
Event12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018 - Almaty, Kazakhstan
Duration: Oct 17 2018Oct 19 2018

Publication series

NameIEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings

Conference

Conference12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018
CountryKazakhstan
CityAlmaty
Period10/17/1810/19/18

Fingerprint

Internet
Language
User-generated content
Normalization
World Wide Web
Breach
Due process
News
Sentiment analysis

Keywords

  • code switching
  • normalization
  • transliteration
  • user generated content

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Information Systems and Management
  • Health Informatics
  • Information Systems

Cite this

Myrzakhmetov, B., Yessenbayev, Z., & Makazhanov, A. (2018). Initial Normalization of User Generated Content: Case Study in a Multilingual Setting. In IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings [8747161] (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICAICT.2018.8747161

Initial Normalization of User Generated Content : Case Study in a Multilingual Setting. / Myrzakhmetov, Bagdat; Yessenbayev, Zhandos; Makazhanov, Aibek.

IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. 8747161 (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Myrzakhmetov, B, Yessenbayev, Z & Makazhanov, A 2018, Initial Normalization of User Generated Content: Case Study in a Multilingual Setting. in IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings., 8747161, IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018, Almaty, Kazakhstan, 10/17/18. https://doi.org/10.1109/ICAICT.2018.8747161
Myrzakhmetov B, Yessenbayev Z, Makazhanov A. Initial Normalization of User Generated Content: Case Study in a Multilingual Setting. In IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2018. 8747161. (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings). https://doi.org/10.1109/ICAICT.2018.8747161
Myrzakhmetov, Bagdat ; Yessenbayev, Zhandos ; Makazhanov, Aibek. / Initial Normalization of User Generated Content : Case Study in a Multilingual Setting. IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings).
@inproceedings{c5b1c994eb954f469cf02d07e67d2573,
title = "Initial Normalization of User Generated Content: Case Study in a Multilingual Setting",
abstract = "We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.",
keywords = "code switching, normalization, transliteration, user generated content",
author = "Bagdat Myrzakhmetov and Zhandos Yessenbayev and Aibek Makazhanov",
year = "2018",
month = "10",
day = "1",
doi = "10.1109/ICAICT.2018.8747161",
language = "English",
series = "IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings",
address = "United States",

}

TY - GEN

T1 - Initial Normalization of User Generated Content

T2 - Case Study in a Multilingual Setting

AU - Myrzakhmetov, Bagdat

AU - Yessenbayev, Zhandos

AU - Makazhanov, Aibek

PY - 2018/10/1

Y1 - 2018/10/1

N2 - We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.

AB - We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.

KW - code switching

KW - normalization

KW - transliteration

KW - user generated content

UR - http://www.scopus.com/inward/record.url?scp=85070207528&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85070207528&partnerID=8YFLogxK

U2 - 10.1109/ICAICT.2018.8747161

DO - 10.1109/ICAICT.2018.8747161

M3 - Conference contribution

AN - SCOPUS:85070207528

T3 - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings

BT - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -