Multimedia Corpus of Modern Spoken Kazakh Language

Project: CRP

Project Details

Grant Program

Collaborative Research Program 2021-2023

Project Description

The purpose of the proposed project is to fill in an important empirical gap in modern research infrastructure for the study of the Kazakh language in its contemporary state by implementing the state-of-the-art multimedia corpus of modern spoken Kazakh language with a total volume of at least 10.000.000 words of collected data, at least 1.000.000 words of transcribed data, and at least 200.000 words of fully annotated natural speech. 
The development of this resource with tremendous scientific and applied potential will provide a crucial missing component to the already existing resources such as the Kazakh National Corpus (of written language), radically increasing overall applied value of the resulting complex data system for further research, education, and industry.
The overarching research theme of the project is exploring the aspects of language diversity, contact, variation and change in Kazakhstan in the regional context of Central Eurasia, i.e. uncovering the details of the present state and understanding the factors of the past, which conditioned the cultural and linguistic composition of Eurasia.
The primary output of the proposed project will be the multimedia modern spoken Kazakh language corpus, a database that is best contributing to the contemporary research needs. The multimedia language materials (multimedia recordings of naturally occurring spoken discourse) are prioritized in this project as the most representative mode of culturally specific and contextually conditioned communication patterns, containing not only traditional linguistic modality, but also documenting wider multimodal aspects of communication by audio and video records. It is expected that by the end of the project, the recorded original primary data (genre and register diverse, spontaneous discourse) will receive full interlinearized annotation and free-translation. This deep-annotated language data will be integrated in diverse multimedia formats using state-of-the-art software tools into a powerful research infrastructure with wide applied value in academic research, education, industry, and policymaking, among areas.
Such a project will have the novelty and significance unique, in that it is incomparable to any previous or concurrent project nationally or regionally in terms of combining the state-of-the-art theoretical, methodological and technological approaches, producing high-quality empirical base for consequent diverse cross-disciplinary applications, including studies in socio-, corpus-, geo-, anthropological linguistics, complex interdisciplinary studies in human history, modern educational programs, industrial applications in Kazakh language speech recognition and speech synthesis, human-machine interactions (Kazakh language computer applications), with significant potential in AI projects.
Short titleKazakh Spoken Corpus
AcronymMultiCorSKL
StatusActive
Effective start/end date1/1/2112/31/24

Keywords

  • Kazakh language
  • spoken language
  • language corpora
  • multimedia data
  • database
  • corpus linguistics
  • linguistics
  • natural language processing
  • language documentation
  • Turkic languages
  • morphologizers

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.