On Various Approaches to Machine Translation from Russian to Kazakh

Aibek Makazhanov, Bagdat Myrzakhmetov, Zhanibek and Kozhirbayev

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this work we compare a number of approaches to machine translation (MT) form Russian to Kazakh. We focus specifically on this pair of languages for a number of reasons. First, these languages are relatively understudied in terms of MT research, as well as, natural language processing (NLP) research in general. Kazakh, in particular, has been actively studied with modern methods for less than a decade. Second, this pair of languages poses several processing challenges rooted in their nature: both languages are morphologically complex and tend to have free order constituents, which makes long term dependencies rather frequent. From the perspective of data-driven approaches to NLP that means increased data sparseness and high OOV rates. Lastly, apart from scientific curiosity there is a strong practical demand for high quality MT between the languages in question. Kazakh is the state language of Kazakhstan, while Russian, due to a strong Soviet heritage, largely remains a language of professional communication and conduct. This frequently results in paperwork being initially prepared in Russian and then translated into Kazakh. Thus, high quality MT systems are in demand as they would greatly reduce manual labor of the professional translators. We categorize the approaches that we compare into data-driven, linguistically motivated and hybrid ones. In the first category we compare a phrase-based statistical MT (SMT) and a neural MT (NMT) approaches. For the latter we experiment with three different neural architectures. As the result of this comparison we conclude that while NMT is a promising research direction one needs a lot more computational resources and, perhaps, even more data to achieve the level of accuracy offered by SMT. As for linguistically motivated and hybrid approaches we compare a rule-based approach with a so called factored model, which is essentially an SMT model that takes into account various linguistic factors, such as parts of speech, lemmata, morphology, etc. Although this comparison has shown that factored models should be strongly favored, we must note that the Russian-Kazakh pair for the rule-based system that was used in the experiment is still a work in progress. Lastly, one final comparison between the best performing models from each category, i.e. a pure data-driven SMT-model and a hybrid factored model, has favored the former. While we acknowledge that the present work makes no significant contribution to the NLP research in general, we want to point out that, to the best of our knowledge, for the particular language pair considered herein experiments on NMT and factored SMT have never been performed before. We speculate that one possible reason for this is the absence of an accessible Russian-Kazakh parallel corpus that is suitable for those experiments in terms of both size and quality. With this in mind we also provide a detailed description of the parallel data set that we used for our experiments and which we plan to make available in the future.
Original languageUndefined/Unknown
Title of host publicationProceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017)
Place of PublicationKazan, Tatarstan
Publication statusPublished - 2017

Cite this

Makazhanov, A., Myrzakhmetov, B., & Kozhirbayev, Z. A. (2017). On Various Approaches to Machine Translation from Russian to Kazakh. In Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017) Kazan, Tatarstan.

On Various Approaches to Machine Translation from Russian to Kazakh. / Makazhanov, Aibek; Myrzakhmetov, Bagdat; Kozhirbayev, Zhanibek and.

Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017). Kazan, Tatarstan, 2017.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Makazhanov, A, Myrzakhmetov, B & Kozhirbayev, ZA 2017, On Various Approaches to Machine Translation from Russian to Kazakh. in Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017). Kazan, Tatarstan.
Makazhanov A, Myrzakhmetov B, Kozhirbayev ZA. On Various Approaches to Machine Translation from Russian to Kazakh. In Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017). Kazan, Tatarstan. 2017
Makazhanov, Aibek ; Myrzakhmetov, Bagdat ; Kozhirbayev, Zhanibek and. / On Various Approaches to Machine Translation from Russian to Kazakh. Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017). Kazan, Tatarstan, 2017.
@inproceedings{b3237e3610dd490696531f22d7ce83ec,
title = "On Various Approaches to Machine Translation from Russian to Kazakh",
abstract = "In this work we compare a number of approaches to machine translation (MT) form Russian to Kazakh. We focus specifically on this pair of languages for a number of reasons. First, these languages are relatively understudied in terms of MT research, as well as, natural language processing (NLP) research in general. Kazakh, in particular, has been actively studied with modern methods for less than a decade. Second, this pair of languages poses several processing challenges rooted in their nature: both languages are morphologically complex and tend to have free order constituents, which makes long term dependencies rather frequent. From the perspective of data-driven approaches to NLP that means increased data sparseness and high OOV rates. Lastly, apart from scientific curiosity there is a strong practical demand for high quality MT between the languages in question. Kazakh is the state language of Kazakhstan, while Russian, due to a strong Soviet heritage, largely remains a language of professional communication and conduct. This frequently results in paperwork being initially prepared in Russian and then translated into Kazakh. Thus, high quality MT systems are in demand as they would greatly reduce manual labor of the professional translators. We categorize the approaches that we compare into data-driven, linguistically motivated and hybrid ones. In the first category we compare a phrase-based statistical MT (SMT) and a neural MT (NMT) approaches. For the latter we experiment with three different neural architectures. As the result of this comparison we conclude that while NMT is a promising research direction one needs a lot more computational resources and, perhaps, even more data to achieve the level of accuracy offered by SMT. As for linguistically motivated and hybrid approaches we compare a rule-based approach with a so called factored model, which is essentially an SMT model that takes into account various linguistic factors, such as parts of speech, lemmata, morphology, etc. Although this comparison has shown that factored models should be strongly favored, we must note that the Russian-Kazakh pair for the rule-based system that was used in the experiment is still a work in progress. Lastly, one final comparison between the best performing models from each category, i.e. a pure data-driven SMT-model and a hybrid factored model, has favored the former. While we acknowledge that the present work makes no significant contribution to the NLP research in general, we want to point out that, to the best of our knowledge, for the particular language pair considered herein experiments on NMT and factored SMT have never been performed before. We speculate that one possible reason for this is the absence of an accessible Russian-Kazakh parallel corpus that is suitable for those experiments in terms of both size and quality. With this in mind we also provide a detailed description of the parallel data set that we used for our experiments and which we plan to make available in the future.",
author = "Aibek Makazhanov and Bagdat Myrzakhmetov and Kozhirbayev, {Zhanibek and}",
year = "2017",
language = "Undefined/Unknown",
booktitle = "Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017)",

}

TY - GEN

T1 - On Various Approaches to Machine Translation from Russian to Kazakh

AU - Makazhanov, Aibek

AU - Myrzakhmetov, Bagdat

AU - Kozhirbayev, Zhanibek and

PY - 2017

Y1 - 2017

N2 - In this work we compare a number of approaches to machine translation (MT) form Russian to Kazakh. We focus specifically on this pair of languages for a number of reasons. First, these languages are relatively understudied in terms of MT research, as well as, natural language processing (NLP) research in general. Kazakh, in particular, has been actively studied with modern methods for less than a decade. Second, this pair of languages poses several processing challenges rooted in their nature: both languages are morphologically complex and tend to have free order constituents, which makes long term dependencies rather frequent. From the perspective of data-driven approaches to NLP that means increased data sparseness and high OOV rates. Lastly, apart from scientific curiosity there is a strong practical demand for high quality MT between the languages in question. Kazakh is the state language of Kazakhstan, while Russian, due to a strong Soviet heritage, largely remains a language of professional communication and conduct. This frequently results in paperwork being initially prepared in Russian and then translated into Kazakh. Thus, high quality MT systems are in demand as they would greatly reduce manual labor of the professional translators. We categorize the approaches that we compare into data-driven, linguistically motivated and hybrid ones. In the first category we compare a phrase-based statistical MT (SMT) and a neural MT (NMT) approaches. For the latter we experiment with three different neural architectures. As the result of this comparison we conclude that while NMT is a promising research direction one needs a lot more computational resources and, perhaps, even more data to achieve the level of accuracy offered by SMT. As for linguistically motivated and hybrid approaches we compare a rule-based approach with a so called factored model, which is essentially an SMT model that takes into account various linguistic factors, such as parts of speech, lemmata, morphology, etc. Although this comparison has shown that factored models should be strongly favored, we must note that the Russian-Kazakh pair for the rule-based system that was used in the experiment is still a work in progress. Lastly, one final comparison between the best performing models from each category, i.e. a pure data-driven SMT-model and a hybrid factored model, has favored the former. While we acknowledge that the present work makes no significant contribution to the NLP research in general, we want to point out that, to the best of our knowledge, for the particular language pair considered herein experiments on NMT and factored SMT have never been performed before. We speculate that one possible reason for this is the absence of an accessible Russian-Kazakh parallel corpus that is suitable for those experiments in terms of both size and quality. With this in mind we also provide a detailed description of the parallel data set that we used for our experiments and which we plan to make available in the future.

AB - In this work we compare a number of approaches to machine translation (MT) form Russian to Kazakh. We focus specifically on this pair of languages for a number of reasons. First, these languages are relatively understudied in terms of MT research, as well as, natural language processing (NLP) research in general. Kazakh, in particular, has been actively studied with modern methods for less than a decade. Second, this pair of languages poses several processing challenges rooted in their nature: both languages are morphologically complex and tend to have free order constituents, which makes long term dependencies rather frequent. From the perspective of data-driven approaches to NLP that means increased data sparseness and high OOV rates. Lastly, apart from scientific curiosity there is a strong practical demand for high quality MT between the languages in question. Kazakh is the state language of Kazakhstan, while Russian, due to a strong Soviet heritage, largely remains a language of professional communication and conduct. This frequently results in paperwork being initially prepared in Russian and then translated into Kazakh. Thus, high quality MT systems are in demand as they would greatly reduce manual labor of the professional translators. We categorize the approaches that we compare into data-driven, linguistically motivated and hybrid ones. In the first category we compare a phrase-based statistical MT (SMT) and a neural MT (NMT) approaches. For the latter we experiment with three different neural architectures. As the result of this comparison we conclude that while NMT is a promising research direction one needs a lot more computational resources and, perhaps, even more data to achieve the level of accuracy offered by SMT. As for linguistically motivated and hybrid approaches we compare a rule-based approach with a so called factored model, which is essentially an SMT model that takes into account various linguistic factors, such as parts of speech, lemmata, morphology, etc. Although this comparison has shown that factored models should be strongly favored, we must note that the Russian-Kazakh pair for the rule-based system that was used in the experiment is still a work in progress. Lastly, one final comparison between the best performing models from each category, i.e. a pure data-driven SMT-model and a hybrid factored model, has favored the former. While we acknowledge that the present work makes no significant contribution to the NLP research in general, we want to point out that, to the best of our knowledge, for the particular language pair considered herein experiments on NMT and factored SMT have never been performed before. We speculate that one possible reason for this is the absence of an accessible Russian-Kazakh parallel corpus that is suitable for those experiments in terms of both size and quality. With this in mind we also provide a detailed description of the parallel data set that we used for our experiments and which we plan to make available in the future.

M3 - Conference contribution

BT - Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017)

CY - Kazan, Tatarstan

ER -