Identification of the parallel documents from multilingual news websites

Bagdat Myrzakhmetov, Aitolkyn Sultangazina, Aibek Makazhanov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations of each other with a tendency of sharing common media and displaying proximity in publication date. We leverage this domain-specific property of the data and propose a straightforward yet competitive heuristic that performs on par with a machine learning-based method in terms of precision, and outperforms a widely used bitext extraction system on a range of metrics. Moreover, this heuristic has allowed us to identify comparable documents overlooked by a human annotator. Although both rule-and learning-based methods that we present are language independent, we specifically focus on the Russian-Kazakh language pair as the present study is one of the initial steps towards a greater objective of building a corresponding parallel corpus and a machine translation system.

Original languageEnglish
Title of host publicationApplication of Information and Communication Technologies, AICT 2016 - Conference Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509018406
DOIs
Publication statusPublished - Jul 25 2017
Event10th IEEE International Conference on Application of Information and Communication Technologies, AICT 2016 - Baku, Azerbaijan
Duration: Oct 12 2016Oct 14 2016

Conference

Conference10th IEEE International Conference on Application of Information and Communication Technologies, AICT 2016
CountryAzerbaijan
CityBaku
Period10/12/1610/14/16

Fingerprint

Websites
Alignment
Heuristics
Learning systems
Machine Translation
Date
Leverage
Proximity
Machine Learning
Sharing
Metric
Experiments
Range of data
Experiment
Language
Learning
Human
Narrative
Corpus

Keywords

  • Document alignment
  • machine translation
  • parallel corpus

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Computer Networks and Communications
  • Information Systems
  • Modelling and Simulation

Cite this

Myrzakhmetov, B., Sultangazina, A., & Makazhanov, A. (2017). Identification of the parallel documents from multilingual news websites. In Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings [7991684] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICAICT.2016.7991684

Identification of the parallel documents from multilingual news websites. / Myrzakhmetov, Bagdat; Sultangazina, Aitolkyn; Makazhanov, Aibek.

Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. 7991684.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Myrzakhmetov, B, Sultangazina, A & Makazhanov, A 2017, Identification of the parallel documents from multilingual news websites. in Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings., 7991684, Institute of Electrical and Electronics Engineers Inc., 10th IEEE International Conference on Application of Information and Communication Technologies, AICT 2016, Baku, Azerbaijan, 10/12/16. https://doi.org/10.1109/ICAICT.2016.7991684
Myrzakhmetov B, Sultangazina A, Makazhanov A. Identification of the parallel documents from multilingual news websites. In Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. 7991684 https://doi.org/10.1109/ICAICT.2016.7991684
Myrzakhmetov, Bagdat ; Sultangazina, Aitolkyn ; Makazhanov, Aibek. / Identification of the parallel documents from multilingual news websites. Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017.
@inproceedings{126c7cfdb3084835a9fb59cef3505506,
title = "Identification of the parallel documents from multilingual news websites",
abstract = "We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations of each other with a tendency of sharing common media and displaying proximity in publication date. We leverage this domain-specific property of the data and propose a straightforward yet competitive heuristic that performs on par with a machine learning-based method in terms of precision, and outperforms a widely used bitext extraction system on a range of metrics. Moreover, this heuristic has allowed us to identify comparable documents overlooked by a human annotator. Although both rule-and learning-based methods that we present are language independent, we specifically focus on the Russian-Kazakh language pair as the present study is one of the initial steps towards a greater objective of building a corresponding parallel corpus and a machine translation system.",
keywords = "Document alignment, machine translation, parallel corpus",
author = "Bagdat Myrzakhmetov and Aitolkyn Sultangazina and Aibek Makazhanov",
year = "2017",
month = "7",
day = "25",
doi = "10.1109/ICAICT.2016.7991684",
language = "English",
booktitle = "Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Identification of the parallel documents from multilingual news websites

AU - Myrzakhmetov, Bagdat

AU - Sultangazina, Aitolkyn

AU - Makazhanov, Aibek

PY - 2017/7/25

Y1 - 2017/7/25

N2 - We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations of each other with a tendency of sharing common media and displaying proximity in publication date. We leverage this domain-specific property of the data and propose a straightforward yet competitive heuristic that performs on par with a machine learning-based method in terms of precision, and outperforms a widely used bitext extraction system on a range of metrics. Moreover, this heuristic has allowed us to identify comparable documents overlooked by a human annotator. Although both rule-and learning-based methods that we present are language independent, we specifically focus on the Russian-Kazakh language pair as the present study is one of the initial steps towards a greater objective of building a corresponding parallel corpus and a machine translation system.

AB - We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations of each other with a tendency of sharing common media and displaying proximity in publication date. We leverage this domain-specific property of the data and propose a straightforward yet competitive heuristic that performs on par with a machine learning-based method in terms of precision, and outperforms a widely used bitext extraction system on a range of metrics. Moreover, this heuristic has allowed us to identify comparable documents overlooked by a human annotator. Although both rule-and learning-based methods that we present are language independent, we specifically focus on the Russian-Kazakh language pair as the present study is one of the initial steps towards a greater objective of building a corresponding parallel corpus and a machine translation system.

KW - Document alignment

KW - machine translation

KW - parallel corpus

UR - http://www.scopus.com/inward/record.url?scp=85034235250&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85034235250&partnerID=8YFLogxK

U2 - 10.1109/ICAICT.2016.7991684

DO - 10.1109/ICAICT.2016.7991684

M3 - Conference contribution

AN - SCOPUS:85034235250

BT - Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -