Assembling the Kazakh language corpus

Olzhas Makhambetov, Aibek Makazhanov, Zhandos Yessenbayev, Bakhyt Matkarimov, Islam Sabyrgaliyev, Anuar Sharafudinov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)

Abstract

This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.

Original languageEnglish
Title of host publicationEMNLP 2013 - 2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages1022-1031
Number of pages10
ISBN (Electronic)9781937284978
Publication statusPublished - 2013
Event2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 - Seattle, United States
Duration: Oct 18 2013Oct 21 2013

Publication series

NameEMNLP 2013 - 2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Other

Other2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013
CountryUnited States
CitySeattle
Period10/18/1310/21/13

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Information Systems
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Assembling the Kazakh language corpus'. Together they form a unique fingerprint.

Cite this