Manual vs automatic bitext extraction

Aibek Makazhanov, Bagdat Myrzakhmetov, Zhenisbek Assylbekov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.

Original languageEnglish
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages3834-3838
Number of pages5
ISBN (Electronic)9791095546009
Publication statusPublished - 2019
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: May 7 2018May 12 2018

Publication series

NameLREC 2018 - 11th International Conference on Language Resources and Evaluation

Other

Other11th International Conference on Language Resources and Evaluation, LREC 2018
Country/TerritoryJapan
CityMiyazaki
Period5/7/185/12/18

Keywords

  • Bitext extraction
  • Crawling
  • Sentence alignment

ASJC Scopus subject areas

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Manual vs automatic bitext extraction'. Together they form a unique fingerprint.

Cite this