Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.

Original languageEnglish
Title of host publicationIEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538664674
DOIs
Publication statusPublished - Oct 1 2018
Event12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018 - Almaty, Kazakhstan
Duration: Oct 17 2018Oct 19 2018

Publication series

NameIEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings

Conference

Conference12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018
CountryKazakhstan
CityAlmaty
Period10/17/1810/19/18

Keywords

  • code switching
  • normalization
  • transliteration
  • user generated content

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Information Systems and Management
  • Health Informatics
  • Information Systems

Fingerprint Dive into the research topics of 'Initial Normalization of User Generated Content: Case Study in a Multilingual Setting'. Together they form a unique fingerprint.

  • Cite this

    Myrzakhmetov, B., Yessenbayev, Z., & Makazhanov, A. (2018). Initial Normalization of User Generated Content: Case Study in a Multilingual Setting. In IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings [8747161] (IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICAICT.2018.8747161