Manual vs automatic bitext extraction

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.

Original languageEnglish
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages3834-3838
Number of pages5
ISBN (Electronic)9791095546009
Publication statusPublished - Jan 1 2019
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: May 7 2018May 12 2018

Other

Other11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period5/7/185/12/18

Fingerprint

coverage
lack
language
World Wide Web
Language
Parallel Corpora
Alignment

Keywords

  • Bitext extraction
  • Crawling
  • Sentence alignment

ASJC Scopus subject areas

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Cite this

Makazhanov, A., Myrzakhmetov, B., & Assylbekov, Z. (2019). Manual vs automatic bitext extraction. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 3834-3838). European Language Resources Association (ELRA).

Manual vs automatic bitext extraction. / Makazhanov, Aibek; Myrzakhmetov, Bagdat; Assylbekov, Zhenisbek.

LREC 2018 - 11th International Conference on Language Resources and Evaluation. ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 3834-3838.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Makazhanov, A, Myrzakhmetov, B & Assylbekov, Z 2019, Manual vs automatic bitext extraction. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), pp. 3834-3838, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 5/7/18.
Makazhanov A, Myrzakhmetov B, Assylbekov Z. Manual vs automatic bitext extraction. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 3834-3838
Makazhanov, Aibek ; Myrzakhmetov, Bagdat ; Assylbekov, Zhenisbek. / Manual vs automatic bitext extraction. LREC 2018 - 11th International Conference on Language Resources and Evaluation. editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 3834-3838
@inproceedings{d7adf96193b447988a9e4b8b5e7fe49c,
title = "Manual vs automatic bitext extraction",
abstract = "We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.",
keywords = "Bitext extraction, Crawling, Sentence alignment",
author = "Aibek Makazhanov and Bagdat Myrzakhmetov and Zhenisbek Assylbekov",
year = "2019",
month = "1",
day = "1",
language = "English",
pages = "3834--3838",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - Manual vs automatic bitext extraction

AU - Makazhanov, Aibek

AU - Myrzakhmetov, Bagdat

AU - Assylbekov, Zhenisbek

PY - 2019/1/1

Y1 - 2019/1/1

N2 - We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.

AB - We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.

KW - Bitext extraction

KW - Crawling

KW - Sentence alignment

UR - http://www.scopus.com/inward/record.url?scp=85059893861&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059893861&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85059893861

SP - 3834

EP - 3838

BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

ER -