TY - GEN
T1 - Manual vs automatic bitext extraction
AU - Makazhanov, Aibek
AU - Myrzakhmetov, Bagdat
AU - Assylbekov, Zhenisbek
N1 - Funding Information:
This work has been funded by the Nazarbayev University under the research grant ?129-2017/022-2017 and by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272.
Funding Information:
This work has been funded by the Nazarbayev University under the research grant №129-2017/022-2017 and by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272.
Publisher Copyright:
© LREC 2018 - 11th International Conference on Language Resources and Evaluation. All rights reserved.
PY - 2019
Y1 - 2019
N2 - We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.
AB - We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.
KW - Bitext extraction
KW - Crawling
KW - Sentence alignment
UR - http://www.scopus.com/inward/record.url?scp=85059893861&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85059893861&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85059893861
T3 - LREC 2018 - 11th International Conference on Language Resources and Evaluation
SP - 3834
EP - 3838
BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation
A2 - Isahara, Hitoshi
A2 - Maegaard, Bente
A2 - Piperidis, Stelios
A2 - Cieri, Christopher
A2 - Declerck, Thierry
A2 - Hasida, Koiti
A2 - Mazo, Helene
A2 - Choukri, Khalid
A2 - Goggi, Sara
A2 - Mariani, Joseph
A2 - Moreno, Asuncion
A2 - Calzolari, Nicoletta
A2 - Odijk, Jan
A2 - Tokunaga, Takenobu
PB - European Language Resources Association (ELRA)
T2 - 11th International Conference on Language Resources and Evaluation, LREC 2018
Y2 - 7 May 2018 through 12 May 2018
ER -