Identification of the parallel documents from multilingual news websites

Bagdat Myrzakhmetov, Aitolkyn Sultangazina, Aibek Makazhanov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations of each other with a tendency of sharing common media and displaying proximity in publication date. We leverage this domain-specific property of the data and propose a straightforward yet competitive heuristic that performs on par with a machine learning-based method in terms of precision, and outperforms a widely used bitext extraction system on a range of metrics. Moreover, this heuristic has allowed us to identify comparable documents overlooked by a human annotator. Although both rule-and learning-based methods that we present are language independent, we specifically focus on the Russian-Kazakh language pair as the present study is one of the initial steps towards a greater objective of building a corresponding parallel corpus and a machine translation system.

Original languageEnglish
Title of host publicationApplication of Information and Communication Technologies, AICT 2016 - Conference Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509018406
DOIs
Publication statusPublished - Jul 25 2017
Event10th IEEE International Conference on Application of Information and Communication Technologies, AICT 2016 - Baku, Azerbaijan
Duration: Oct 12 2016Oct 14 2016

Conference

Conference10th IEEE International Conference on Application of Information and Communication Technologies, AICT 2016
CountryAzerbaijan
CityBaku
Period10/12/1610/14/16

Keywords

  • Document alignment
  • machine translation
  • parallel corpus

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Computer Networks and Communications
  • Information Systems
  • Modelling and Simulation

Fingerprint Dive into the research topics of 'Identification of the parallel documents from multilingual news websites'. Together they form a unique fingerprint.

Cite this