Manual vs Automatic Bitext Extraction

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post- processing as well as fiddling with off-the-shelf solutions pays noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve less sentence pairs) and on average are less precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.
Original languageUndefined/Unknown
Title of host publication11th International Conference on Language Resources and Evaluation
Place of PublicationMiyazaki, Japan
Publication statusPublished - 2018

Cite this

Makazhanov, A., Myrzakhmetov, B., & Assylbekov, Z. (2018). Manual vs Automatic Bitext Extraction. In 11th International Conference on Language Resources and Evaluation Miyazaki, Japan.

Manual vs Automatic Bitext Extraction. / Makazhanov, Aibek; Myrzakhmetov, Bagdat; Assylbekov, Zhenisbek.

11th International Conference on Language Resources and Evaluation. Miyazaki, Japan, 2018.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Makazhanov, A, Myrzakhmetov, B & Assylbekov, Z 2018, Manual vs Automatic Bitext Extraction. in 11th International Conference on Language Resources and Evaluation. Miyazaki, Japan.
Makazhanov A, Myrzakhmetov B, Assylbekov Z. Manual vs Automatic Bitext Extraction. In 11th International Conference on Language Resources and Evaluation. Miyazaki, Japan. 2018
Makazhanov, Aibek ; Myrzakhmetov, Bagdat ; Assylbekov, Zhenisbek. / Manual vs Automatic Bitext Extraction. 11th International Conference on Language Resources and Evaluation. Miyazaki, Japan, 2018.
@inproceedings{ebd43cbdd7f64f86a7d741268c952513,
title = "Manual vs Automatic Bitext Extraction",
abstract = "We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post- processing as well as fiddling with off-the-shelf solutions pays noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve less sentence pairs) and on average are less precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.",
author = "Aibek Makazhanov and Bagdat Myrzakhmetov and Zhenisbek Assylbekov",
year = "2018",
language = "Undefined/Unknown",
booktitle = "11th International Conference on Language Resources and Evaluation",

}

TY - GEN

T1 - Manual vs Automatic Bitext Extraction

AU - Makazhanov, Aibek

AU - Myrzakhmetov, Bagdat

AU - Assylbekov, Zhenisbek

PY - 2018

Y1 - 2018

N2 - We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post- processing as well as fiddling with off-the-shelf solutions pays noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve less sentence pairs) and on average are less precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.

AB - We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post- processing as well as fiddling with off-the-shelf solutions pays noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve less sentence pairs) and on average are less precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.

M3 - Conference contribution

BT - 11th International Conference on Language Resources and Evaluation

CY - Miyazaki, Japan

ER -