KazNLP: A Pipeline for Automated Processing of Texts Written in Kazakh Language

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We present the current results of our ongoing work on develop-ing tools and algorithms for processing Kazakh language in the framework of KazNLP project. The project is motivated by the need in accessible, easy to use, cross-platform, and well-documented automated text processing tools for Kazakh, particularly user generated text, which includes transliteration, code switching, and other artifacts of language-specific raw data that needs pre-processing. Thus, apart from a basic tokenization-tagging-parsing pipeline, and downstream applications such as named entity recognition and spell checking, KazNLP offers pre-processing tools such as text normalization and language identification. All of the KazNLP tools are released under the Creative Commons license. Since the detailed description of the methods and algorithms that were used in KazNLP are published or to be published in various venues, reference to which is given in the corresponding sections, this work provides just an overview of the tools and their performance level.

Original languageEnglish
Title of host publicationSpeech and Computer - 22nd International Conference, SPECOM 2020, Proceedings
EditorsAlexey Karpov, Rodmonga Potapova
PublisherSpringer Science and Business Media Deutschland GmbH
Pages657-666
Number of pages10
ISBN (Print)9783030602758
DOIs
Publication statusPublished - 2020
Event22nd International Conference on Speech and Computer, SPECOM 2020 - St. Petersburg, Russian Federation
Duration: Oct 7 2020Oct 9 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12335 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference22nd International Conference on Speech and Computer, SPECOM 2020
CountryRussian Federation
CitySt. Petersburg
Period10/7/2010/9/20

Keywords

  • Computational linguistics
  • Corpus linguistics
  • Kazakh language
  • Natural language processing
  • Programming tools

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'KazNLP: A Pipeline for Automated Processing of Texts Written in Kazakh Language'. Together they form a unique fingerprint.

Cite this