Abstract
A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.
Similar content being viewed by others
References
Alegria I, Ezeiza N, Fernandez I (2006) Named entities translation based on comparable corpora. In: Proceedings of the 11th conference of the European Chapter of the Association for Computational Linguistics, workshop on multi-word expressions in a multilingual context, Trento, Italy, pp 1–8
Ambati V, Vogel S (2010) Can crowds build parallel corpora for machine translation systems? In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (CSLDAMT ’10), Los Angeles, CA, pp 62–65
Bennison P, Bowker L (2000) Designing a tool for exploiting bilingual comparable corpora. In: 2nd International conference on language resources and evaluation (LREC), Athens, Greece
Bloodgood M, Callison-Burch C (2010) Using mechanical turk to build machine translation evaluation sets. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (CSLDAMT ’10), Los Angeles, CA, pp 208–211
Brown PF, Cocke J, Della-Pietra SA, Della-Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2): 79–85
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation (StatMT ’08), Columbus, OH, pp 70–106
Chiao YC, Zweigenbaum P (2002) Looking for French-English translations in comparable medical corpora. In: Proceedings of the American Medical Informatics Association (AMIA) symposium, Boston, MA, pp 150–154
Deléger L, Zweigenbaum P (2009) Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 2–10
Eisele A, Xu J (2010) Improving machine translation performance using comparable corpora. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 35–41
Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 57–63
Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: COLING-ACL ’98, 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, proceedings of the conference, vol I, Montreal, QC, Canada, pp 414–420
Fung P, Prochasson E, Shi S (2010) Trillions of comparable documents. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 26–34
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1): 75–102
Germann U (2001) Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In: 39th Annual meeting of the Association for Computational Linguistics and 10th conference of the European Chapter of the Association for Computational Linguistics, proceedings of the conference, Toulouse, France, pp 1–8
Hildebrand AS, Eck M, Vogel S, Waibel A (2005) Adaptation of the translation model for statistical machine translation based on information retrieval. In: 10th EAMT conference: practical applications of machine translation, conference proceedings, Budapest, Hungary, pp 133–142
Huang F, Zhang Y, Vogel S (2005) Mining key phrase translations from web corpora. In: HLT/EMNLP 2005: human language technology conference and conference on empirical methods in natural language processing, proceedings of the conference, Vancouver, BC, Canada, pp 483–490
Ishisaka T, Yamamoto K, Utiyama M, Sumita E (2009) Development of a Japanese-English software manual parallel corpus. In: MT Summit XII: proceedings of the twelfth machine translation summit, Ottawa, ON, Canada, pp 254–259
Ji H (2009) Mining name translations from comparable corpora by creating bilingual information networks. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 34–37
Kaji H (2003) Word sense acquisition from bilingual comparable corpora. In: Proceedings of the 2003 conference of the North American Chapter of the Association for Computational Linguistics on human language technology (NAACL), Edmonton, Canada, pp 32–39
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT Summit X: the tenth machine translation summit, Phuket, Thailand, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: ACL 2007, proceedings of the interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180
Kumano T, Tanaka H, Tokunaga T (2007) Extracting phrasal alignments from comparable corpora by using joint probability SMT model. In: TMI 2007: proceedings of the 11th international conference on theoretical and methodological issues in machine translation, Skvde, Sweden, pp 95–103
Lopez A (2008) Statistical machine translation. ACM Comput Surv 40(3): 1–49
Lu, B, Jiang T, Chow K, Tsou BK (2010) Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In: Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp 42–48
Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval. 1. Cambridge University Press, New York
Masuichi H, Flournoy R, Kaufmann S, Peters S (2000) A bootstrapping method for extracting bilingual text pairs. In: The 18th international conference on computational linguistics, COLING 2000 in Europe, proceedings of the conference Vol 2, Saarbrücken, Germany, pp 1066–1070
McEnery A, Xiao Z (2007) Parallel and comparable corpora: What are they up to? In: Incorporating corpora: Translation and the linguist. Translating Europe. Multilingual matters, Chap XX, Clevedon, UK
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4): 477–504
Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sydney, Australia, pp 81–88
Nie J, Simard M, Isabelle P, Dur R (1999) Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’99), Berkeley, CA, pp 74–81
Oard DW (1997) Alternative approaches for cross-language text retrieval. In: In AAAI symposium on cross-Language text and speech retrieval, Stanford, CA, USA, pp 154–162
Och FJ, Ney H (2002) Discriminative training and maximum entropy models for statistical machine translation. In: 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 295–302
Ogilvie P, Callan J (2001) Experiments using the Lemur toolkit. In: Proceedings of the tenth text retrieval conference (TREC-10), Gaithersburg, MD, USA, pp 103–108
Pekar V, Mitkov R, Blagoev D, Mulloni A (2006) Finding translations for low-frequency words in comparable corpora. Mach Transl 20(4): 247–266
Quirk C, Udupa R, Menezes A (2007) Generative models of noisy translations with applications to parallel fragment extraction. In: Machine translation summit XI: proceedings, Copenhagen, Denmark, pp 377–384
Rapp R (1995) Identifying word translations in non-parallel texts. In: 33rd Annual meeting of the Association for Computational Linguistics, Cambridge, MA, USA, pp 320–322
Rauf SA, Schwenk H (2009a) Exploiting comparable corpora with TER and TERp. In: Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 46–54
Rauf SA, Schwenk H (2009b) On the use of comparable corpora to improve SMT performance. In: EACL 2009: proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, pp 16–23
Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3): 349–380
Sadat F, Yoshikawa M, Uemura S (2003) Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. In: 41st Annual meeting of the Association for Computational Linguistics, proceedings of the conference, vol 2, Sapporo, Japan, pp 141–144
Sharoff S, Babych B, Hartley A (2006) Using collocations from comparable corpora to find translation equivalents. In: Proceedings of the fifth language resources and evaluation conference, LREC 2006, Genoa, Italy, pp 465-470
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, MA, USA, pp 223–231
Snover M, Dorr B, Schwartz R (2008) Language and translation model adaptation using comparable corpora. In: EMNLP 2008: 2008 conference on empirical methods in natural language processing, proceedings of the conference, Honolulu, Hawaii, USA, pp 857–866
Snover M, Madnani N, Dorr B, Schwartz R (2009) Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In: Proceedings of the fourth workshop on statistical machine translation, Association for Computational Linguistics, Athens, Greece, pp 259–268
Sproat RT, Zhai C (2006) Named entity transliteration with comparable corpora. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sydney, Australia, pp 73–80
Talvensaari T (2008) Comparable corpora in cross-language information retrieval. PhD Thesis, University of Tampere, Tampere, Finland
Uszkoreit J, Ponte JM, Popat AC, Dubiner M (2010) Large scale parallel document mining for machine translation. In: COLING 2010, 23rd international conference on computational linguistics, proceedings of the conference, Beijing, China, pp 1101–1109
Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: 41st Annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sapporo, Japan, pp 72–79
Wu D, Fung P (2005) Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of the 2nd international joint conference on natural language proceedings (IJCNLP 05), lecture notes in computer science, vol 3651. Springer, Berlin, pp 257–268
Xabier S, Iflaki SV, Maddalen L (2008) Mining term translations from domain restricted comparable corpora. In: 24th Conference of the Spanish Society for Natural Language Processing (SEPLN), Madrid, Spain, pp 273–280
Yang CC, Li KW (2003) Automatic construction of English/Chinese parallel corpora. J Am Soc Inf Sci Technol 54(8): 730–742
Zhang Y, Wu K, Gao J, Vines P (2006) Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of 28th European conference on information retrieval, lecture notes in computer science, vol 3936. Springer, Berlin, pp 420–431
Zhao B, Vogel S (2002) Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002). IEEE Computer Society, Maebashi, Japan, pp 745–748
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Abdul Rauf, S., Schwenk, H. Parallel sentence generation from comparable corpora for improved SMT. Machine Translation 25, 341–375 (2011). https://doi.org/10.1007/s10590-011-9114-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9114-9