TY - GEN
T1 - Initial Normalization of User Generated Content
T2 - 12th IEEE International Conference on Application of Information and Communication Technologies, AICT 2018
AU - Myrzakhmetov, Bagdat
AU - Yessenbayev, Zhandos
AU - Makazhanov, Aibek
N1 - Funding Information:
This work has been supported by Nazarbayev University research grant 144-2018//010-2018 and the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272.
PY - 2018/10/1
Y1 - 2018/10/1
N2 - We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.
AB - We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.
KW - code switching
KW - normalization
KW - transliteration
KW - user generated content
UR - http://www.scopus.com/inward/record.url?scp=85070207528&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85070207528&partnerID=8YFLogxK
U2 - 10.1109/ICAICT.2018.8747161
DO - 10.1109/ICAICT.2018.8747161
M3 - Conference contribution
AN - SCOPUS:85070207528
T3 - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings
BT - IEEE 12th International Conference on Application of Information and Communication Technologies, AICT 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 October 2018 through 19 October 2018
ER -