Building a Kazakh Dependency Treebank

A dependency treebank (treebank hereinafter) is a collection of texts (a corpus) labeled for morphology and syntax according to an annotation scheme (set of labels to mark linguistic phenomena) based on a certain implementation of a dependency grammar. Building a treebank entails designing an annotation scheme, collecting electronic text of various genres, segmenting the text into sentences and syntactic words, annotating the text and validating the annotation. Treebanks are invaluable sources of data essential for general (empirical) and computational linguistic research, as well as for practical applications, such as machine translation, voice and text search, sentiment analysis, opinion mining, and other text and speech processing tasks.
The project aims at building a large treebank of Kazakh. Currently decent-sized treebanks are available for only a handful of world’s languages and the project implementation team believes that Kazakh language can and should be among those fortunate few. Building a large enough treebank without basic text processing tools, such as tokenizers, morphological and syntactic parsers, is virtually impossible. Thus, as a means of building the treebank, the project has an additional goal of developing basic tools, as well as real world applications (e.g., machine translation systems).

The project is being implemented on schedule, and in two years of implementation the following has been accomplished:
- an annotation scheme compatible with international standards has been developed;
- about 63% of the treebank (470 thousand of an expected minimum of 750 thousand words) is complete (i.e. texts are split into sentences and syntactic words, which are annotated and checked);
- prototypes of the basic text processing tools, namely text tokenizers, morphological analyzers and taggers, and a syntactic parser, have been built and evaluated;
- a machine translation prototype has been implemented (

A total of 11 research papers were published (4 are accepted for publication, and will be published soon), of which:
- 2 are in the proceedings of top tier conferences* (indexed by Web of Science and Scopus, no impact factor);
- 7 are in the proceedings of 2nd and 3rd tier conferences (3 indexed by Scopus, no impact factor);
- 2 are in international journals (no impact factor).
Effective start/end date1/1/1612/31/18