Skip to main content
Log in

Parallel sentence generation from comparable corpora for improved SMT

  • Published:
Machine Translation

Abstract

A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Alegria I, Ezeiza N, Fernandez I (2006) Named entities translation based on comparable corpora. In: Proceedings of the 11th conference of the European Chapter of the Association for Computational Linguistics, workshop on multi-word expressions in a multilingual context, Trento, Italy, pp 1–8

  • Ambati V, Vogel S (2010) Can crowds build parallel corpora for machine translation systems? In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (CSLDAMT ’10), Los Angeles, CA, pp 62–65

  • Bennison P, Bowker L (2000) Designing a tool for exploiting bilingual comparable corpora. In: 2nd International conference on language resources and evaluation (LREC), Athens, Greece

  • Bloodgood M, Callison-Burch C (2010) Using mechanical turk to build machine translation evaluation sets. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (CSLDAMT ’10), Los Angeles, CA, pp 208–211

  • Brown PF, Cocke J, Della-Pietra SA, Della-Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2): 79–85

    Google Scholar 

  • Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation (StatMT ’08), Columbus, OH, pp 70–106

  • Chiao YC, Zweigenbaum P (2002) Looking for French-English translations in comparable medical corpora. In: Proceedings of the American Medical Informatics Association (AMIA) symposium, Boston, MA, pp 150–154

  • Deléger L, Zweigenbaum P (2009) Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 2–10

  • Eisele A, Xu J (2010) Improving machine translation performance using comparable corpora. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 35–41

  • Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 57–63

  • Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: COLING-ACL ’98, 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, proceedings of the conference, vol I, Montreal, QC, Canada, pp 414–420

  • Fung P, Prochasson E, Shi S (2010) Trillions of comparable documents. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 26–34

  • Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1): 75–102

    Google Scholar 

  • Germann U (2001) Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In: 39th Annual meeting of the Association for Computational Linguistics and 10th conference of the European Chapter of the Association for Computational Linguistics, proceedings of the conference, Toulouse, France, pp 1–8

  • Hildebrand AS, Eck M, Vogel S, Waibel A (2005) Adaptation of the translation model for statistical machine translation based on information retrieval. In: 10th EAMT conference: practical applications of machine translation, conference proceedings, Budapest, Hungary, pp 133–142

  • Huang F, Zhang Y, Vogel S (2005) Mining key phrase translations from web corpora. In: HLT/EMNLP 2005: human language technology conference and conference on empirical methods in natural language processing, proceedings of the conference, Vancouver, BC, Canada, pp 483–490

  • Ishisaka T, Yamamoto K, Utiyama M, Sumita E (2009) Development of a Japanese-English software manual parallel corpus. In: MT Summit XII: proceedings of the twelfth machine translation summit, Ottawa, ON, Canada, pp 254–259

  • Ji H (2009) Mining name translations from comparable corpora by creating bilingual information networks. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 34–37

  • Kaji H (2003) Word sense acquisition from bilingual comparable corpora. In: Proceedings of the 2003 conference of the North American Chapter of the Association for Computational Linguistics on human language technology (NAACL), Edmonton, Canada, pp 32–39

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT Summit X: the tenth machine translation summit, Phuket, Thailand, pp 79–86

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: ACL 2007, proceedings of the interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180

  • Kumano T, Tanaka H, Tokunaga T (2007) Extracting phrasal alignments from comparable corpora by using joint probability SMT model. In: TMI 2007: proceedings of the 11th international conference on theoretical and methodological issues in machine translation, Skvde, Sweden, pp 95–103

  • Lopez A (2008) Statistical machine translation. ACM Comput Surv 40(3): 1–49

    Article  Google Scholar 

  • Lu, B, Jiang T, Chow K, Tsou BK (2010) Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 42–48

  • Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval. 1. Cambridge University Press, New York

    Google Scholar 

  • Masuichi H, Flournoy R, Kaufmann S, Peters S (2000) A bootstrapping method for extracting bilingual text pairs. In: The 18th international conference on computational linguistics, COLING 2000 in Europe, proceedings of the conference Vol 2, Saarbrücken, Germany, pp 1066–1070

  • McEnery A, Xiao Z (2007) Parallel and comparable corpora: What are they up to? In: Incorporating corpora: Translation and the linguist. Translating Europe. Multilingual matters, Chap XX, Clevedon, UK

  • Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4): 477–504

    Article  Google Scholar 

  • Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sydney, Australia, pp 81–88

  • Nie J, Simard M, Isabelle P, Dur R (1999) Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’99), Berkeley, CA, pp 74–81

  • Oard DW (1997) Alternative approaches for cross-language text retrieval. In: In AAAI symposium on cross-Language text and speech retrieval, Stanford, CA, USA, pp 154–162

  • Och FJ, Ney H (2002) Discriminative training and maximum entropy models for statistical machine translation. In: 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 295–302

  • Ogilvie P, Callan J (2001) Experiments using the Lemur toolkit. In: Proceedings of the tenth text retrieval conference (TREC-10), Gaithersburg, MD, USA, pp 103–108

  • Pekar V, Mitkov R, Blagoev D, Mulloni A (2006) Finding translations for low-frequency words in comparable corpora. Mach Transl 20(4): 247–266

    Article  Google Scholar 

  • Quirk C, Udupa R, Menezes A (2007) Generative models of noisy translations with applications to parallel fragment extraction. In: Machine translation summit XI: proceedings, Copenhagen, Denmark, pp 377–384

  • Rapp R (1995) Identifying word translations in non-parallel texts. In: 33rd Annual meeting of the Association for Computational Linguistics, Cambridge, MA, USA, pp 320–322

  • Rauf SA, Schwenk H (2009a) Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 46–54

  • Rauf SA, Schwenk H (2009b) On the use of comparable corpora to improve SMT performance. In: EACL 2009: proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, pp 16–23

  • Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3): 349–380

    Article  Google Scholar 

  • Sadat F, Yoshikawa M, Uemura S (2003) Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. In: 41st Annual meeting of the Association for Computational Linguistics, proceedings of the conference, vol 2, Sapporo, Japan, pp 141–144

  • Sharoff S, Babych B, Hartley A (2006) Using collocations from comparable corpora to find translation equivalents. In: Proceedings of the fifth language resources and evaluation conference, LREC 2006, Genoa, Italy, pp 465-470

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, MA, USA, pp 223–231

  • Snover M, Dorr B, Schwartz R (2008) Language and translation model adaptation using comparable corpora. In: EMNLP 2008: 2008 conference on empirical methods in natural language processing, proceedings of the conference, Honolulu, Hawaii, USA, pp 857–866

  • Snover M, Madnani N, Dorr B, Schwartz R (2009) Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In: Proceedings of the fourth workshop on statistical machine translation, Association for Computational Linguistics, Athens, Greece, pp 259–268

  • Sproat RT, Zhai C (2006) Named entity transliteration with comparable corpora. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sydney, Australia, pp 73–80

  • Talvensaari T (2008) Comparable corpora in cross-language information retrieval. PhD Thesis, University of Tampere, Tampere, Finland

  • Uszkoreit J, Ponte JM, Popat AC, Dubiner M (2010) Large scale parallel document mining for machine translation. In: COLING 2010, 23rd international conference on computational linguistics, proceedings of the conference, Beijing, China, pp 1101–1109

  • Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: 41st Annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sapporo, Japan, pp 72–79

  • Wu D, Fung P (2005) Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of the 2nd international joint conference on natural language proceedings (IJCNLP 05), lecture notes in computer science, vol 3651. Springer, Berlin, pp 257–268

  • Xabier S, Iflaki SV, Maddalen L (2008) Mining term translations from domain restricted comparable corpora. In: 24th Conference of the Spanish Society for Natural Language Processing (SEPLN), Madrid, Spain, pp 273–280

  • Yang CC, Li KW (2003) Automatic construction of English/Chinese parallel corpora. J Am Soc Inf Sci Technol 54(8): 730–742

    Article  Google Scholar 

  • Zhang Y, Wu K, Gao J, Vines P (2006) Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of 28th European conference on information retrieval, lecture notes in computer science, vol 3936. Springer, Berlin, pp 420–431

  • Zhao B, Vogel S (2002) Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002). IEEE Computer Society, Maebashi, Japan, pp 745–748

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sadaf Abdul Rauf.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdul Rauf, S., Schwenk, H. Parallel sentence generation from comparable corpora for improved SMT. Machine Translation 25, 341–375 (2011). https://doi.org/10.1007/s10590-011-9114-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-011-9114-9

Keywords

Navigation