Abstract
We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations of each other with a tendency of sharing common media and displaying proximity in publication date. We leverage this domain-specific property of the data and propose a straightforward yet competitive heuristic that performs on par with a machine learning-based method in terms of precision, and outperforms a widely used bitext extraction system on a range of metrics. Moreover, this heuristic has allowed us to identify comparable documents overlooked by a human annotator. Although both rule-and learning-based methods that we present are language independent, we specifically focus on the Russian-Kazakh language pair as the present study is one of the initial steps towards a greater objective of building a corresponding parallel corpus and a machine translation system.
Original language | English |
---|---|
Title of host publication | Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9781509018406 |
DOIs | |
Publication status | Published - Jul 25 2017 |
Event | 10th IEEE International Conference on Application of Information and Communication Technologies, AICT 2016 - Baku, Azerbaijan Duration: Oct 12 2016 → Oct 14 2016 |
Conference
Conference | 10th IEEE International Conference on Application of Information and Communication Technologies, AICT 2016 |
---|---|
Country | Azerbaijan |
City | Baku |
Period | 10/12/16 → 10/14/16 |
Keywords
- Document alignment
- machine translation
- parallel corpus
ASJC Scopus subject areas
- Computer Vision and Pattern Recognition
- Computer Science Applications
- Computer Networks and Communications
- Information Systems
- Modelling and Simulation