Parallel sentence generation from comparable corpora for improved SMT

Abdul Rauf, Sadaf; Schwenk, Holger

doi:10.1007/s10590-011-9114-9

Parallel sentence generation from comparable corpora for improved SMT

Published: 09 October 2011

Volume 25, pages 341–375, (2011)
Cite this article

Machine Translation

Sadaf Abdul Rauf¹ &
Holger Schwenk¹

572 Accesses
23 Citations
Explore all metrics

Abstract

A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alegria I, Ezeiza N, Fernandez I (2006) Named entities translation based on comparable corpora. In: Proceedings of the 11th conference of the European Chapter of the Association for Computational Linguistics, workshop on multi-word expressions in a multilingual context, Trento, Italy, pp 1–8
Ambati V, Vogel S (2010) Can crowds build parallel corpora for machine translation systems? In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (CSLDAMT ’10), Los Angeles, CA, pp 62–65
Bennison P, Bowker L (2000) Designing a tool for exploiting bilingual comparable corpora. In: 2nd International conference on language resources and evaluation (LREC), Athens, Greece
Bloodgood M, Callison-Burch C (2010) Using mechanical turk to build machine translation evaluation sets. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (CSLDAMT ’10), Los Angeles, CA, pp 208–211
Brown PF, Cocke J, Della-Pietra SA, Della-Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2): 79–85
Google Scholar
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation (StatMT ’08), Columbus, OH, pp 70–106
Chiao YC, Zweigenbaum P (2002) Looking for French-English translations in comparable medical corpora. In: Proceedings of the American Medical Informatics Association (AMIA) symposium, Boston, MA, pp 150–154
Deléger L, Zweigenbaum P (2009) Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 2–10
Eisele A, Xu J (2010) Improving machine translation performance using comparable corpora. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 35–41
Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 57–63
Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: COLING-ACL ’98, 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, proceedings of the conference, vol I, Montreal, QC, Canada, pp 414–420
Fung P, Prochasson E, Shi S (2010) Trillions of comparable documents. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 26–34
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1): 75–102
Google Scholar
Germann U (2001) Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In: 39th Annual meeting of the Association for Computational Linguistics and 10th conference of the European Chapter of the Association for Computational Linguistics, proceedings of the conference, Toulouse, France, pp 1–8
Hildebrand AS, Eck M, Vogel S, Waibel A (2005) Adaptation of the translation model for statistical machine translation based on information retrieval. In: 10th EAMT conference: practical applications of machine translation, conference proceedings, Budapest, Hungary, pp 133–142
Huang F, Zhang Y, Vogel S (2005) Mining key phrase translations from web corpora. In: HLT/EMNLP 2005: human language technology conference and conference on empirical methods in natural language processing, proceedings of the conference, Vancouver, BC, Canada, pp 483–490
Ishisaka T, Yamamoto K, Utiyama M, Sumita E (2009) Development of a Japanese-English software manual parallel corpus. In: MT Summit XII: proceedings of the twelfth machine translation summit, Ottawa, ON, Canada, pp 254–259
Ji H (2009) Mining name translations from comparable corpora by creating bilingual information networks. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 34–37
Kaji H (2003) Word sense acquisition from bilingual comparable corpora. In: Proceedings of the 2003 conference of the North American Chapter of the Association for Computational Linguistics on human language technology (NAACL), Edmonton, Canada, pp 32–39
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT Summit X: the tenth machine translation summit, Phuket, Thailand, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: ACL 2007, proceedings of the interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180
Kumano T, Tanaka H, Tokunaga T (2007) Extracting phrasal alignments from comparable corpora by using joint probability SMT model. In: TMI 2007: proceedings of the 11th international conference on theoretical and methodological issues in machine translation, Skvde, Sweden, pp 95–103
Lopez A (2008) Statistical machine translation. ACM Comput Surv 40(3): 1–49
Article Google Scholar
Lu, B, Jiang T, Chow K, Tsou BK (2010) Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 42–48
Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval. 1. Cambridge University Press, New York
Google Scholar
Masuichi H, Flournoy R, Kaufmann S, Peters S (2000) A bootstrapping method for extracting bilingual text pairs. In: The 18th international conference on computational linguistics, COLING 2000 in Europe, proceedings of the conference Vol 2, Saarbrücken, Germany, pp 1066–1070
McEnery A, Xiao Z (2007) Parallel and comparable corpora: What are they up to? In: Incorporating corpora: Translation and the linguist. Translating Europe. Multilingual matters, Chap XX, Clevedon, UK
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4): 477–504
Article Google Scholar
Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sydney, Australia, pp 81–88
Nie J, Simard M, Isabelle P, Dur R (1999) Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’99), Berkeley, CA, pp 74–81
Oard DW (1997) Alternative approaches for cross-language text retrieval. In: In AAAI symposium on cross-Language text and speech retrieval, Stanford, CA, USA, pp 154–162
Och FJ, Ney H (2002) Discriminative training and maximum entropy models for statistical machine translation. In: 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 295–302
Ogilvie P, Callan J (2001) Experiments using the Lemur toolkit. In: Proceedings of the tenth text retrieval conference (TREC-10), Gaithersburg, MD, USA, pp 103–108
Pekar V, Mitkov R, Blagoev D, Mulloni A (2006) Finding translations for low-frequency words in comparable corpora. Mach Transl 20(4): 247–266
Article Google Scholar
Quirk C, Udupa R, Menezes A (2007) Generative models of noisy translations with applications to parallel fragment extraction. In: Machine translation summit XI: proceedings, Copenhagen, Denmark, pp 377–384
Rapp R (1995) Identifying word translations in non-parallel texts. In: 33rd Annual meeting of the Association for Computational Linguistics, Cambridge, MA, USA, pp 320–322
Rauf SA, Schwenk H (2009a) Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 46–54
Rauf SA, Schwenk H (2009b) On the use of comparable corpora to improve SMT performance. In: EACL 2009: proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, pp 16–23
Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3): 349–380
Article Google Scholar
Sadat F, Yoshikawa M, Uemura S (2003) Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. In: 41st Annual meeting of the Association for Computational Linguistics, proceedings of the conference, vol 2, Sapporo, Japan, pp 141–144
Sharoff S, Babych B, Hartley A (2006) Using collocations from comparable corpora to find translation equivalents. In: Proceedings of the fifth language resources and evaluation conference, LREC 2006, Genoa, Italy, pp 465-470
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, MA, USA, pp 223–231
Snover M, Dorr B, Schwartz R (2008) Language and translation model adaptation using comparable corpora. In: EMNLP 2008: 2008 conference on empirical methods in natural language processing, proceedings of the conference, Honolulu, Hawaii, USA, pp 857–866
Snover M, Madnani N, Dorr B, Schwartz R (2009) Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In: Proceedings of the fourth workshop on statistical machine translation, Association for Computational Linguistics, Athens, Greece, pp 259–268
Sproat RT, Zhai C (2006) Named entity transliteration with comparable corpora. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sydney, Australia, pp 73–80
Talvensaari T (2008) Comparable corpora in cross-language information retrieval. PhD Thesis, University of Tampere, Tampere, Finland
Uszkoreit J, Ponte JM, Popat AC, Dubiner M (2010) Large scale parallel document mining for machine translation. In: COLING 2010, 23rd international conference on computational linguistics, proceedings of the conference, Beijing, China, pp 1101–1109
Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: 41st Annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sapporo, Japan, pp 72–79
Wu D, Fung P (2005) Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of the 2nd international joint conference on natural language proceedings (IJCNLP 05), lecture notes in computer science, vol 3651. Springer, Berlin, pp 257–268
Xabier S, Iflaki SV, Maddalen L (2008) Mining term translations from domain restricted comparable corpora. In: 24th Conference of the Spanish Society for Natural Language Processing (SEPLN), Madrid, Spain, pp 273–280
Yang CC, Li KW (2003) Automatic construction of English/Chinese parallel corpora. J Am Soc Inf Sci Technol 54(8): 730–742
Article Google Scholar
Zhang Y, Wu K, Gao J, Vines P (2006) Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of 28th European conference on information retrieval, lecture notes in computer science, vol 3936. Springer, Berlin, pp 420–431
Zhao B, Vogel S (2002) Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002). IEEE Computer Society, Maebashi, Japan, pp 745–748

Download references

Author information

Authors and Affiliations

LIUM, University of Le Mans, Le Mans Cedex 9, France
Sadaf Abdul Rauf & Holger Schwenk

Authors

Sadaf Abdul Rauf
View author publications
You can also search for this author in PubMed Google Scholar
Holger Schwenk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sadaf Abdul Rauf.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdul Rauf, S., Schwenk, H. Parallel sentence generation from comparable corpora for improved SMT. Machine Translation 25, 341–375 (2011). https://doi.org/10.1007/s10590-011-9114-9

Download citation

Received: 15 September 2010
Accepted: 13 September 2011
Published: 09 October 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10590-011-9114-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel sentence generation from comparable corpora for improved SMT

Abstract

Access this article

Similar content being viewed by others

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Mapping and Aligning Units from Comparable Corpora

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel sentence generation from comparable corpora for improved SMT

Abstract

Access this article

Similar content being viewed by others

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Mapping and Aligning Units from Comparable Corpora

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation