Metrics for MT evaluation: evaluating reordering

Birch, Alexandra; Osborne, Miles; Blunsom, Phil

doi:10.1007/s10590-009-9066-5

Metrics for MT evaluation: evaluating reordering

Published: 07 January 2010

Volume 24, pages 15–26, (2010)
Cite this article

Machine Translation

Alexandra Birch¹,
Miles Osborne¹ &
Phil Blunsom²

297 Accesses
9 Citations
Explore all metrics

Abstract

Translating between dissimilar languages requires an account of the use of divergent word orders when expressing the same semantic content. Reordering poses a serious problem for statistical machine translation systems and has generated a considerable body of research aimed at meeting its challenges. Direct evaluation of reordering requires automatic metrics that explicitly measure the quality of word order choices in translations. Current metrics, such as BLEU, only evaluate reordering indirectly. We analyse the ability of current metrics to capture reordering performance. We then introduce permutation distance metrics as a direct method for measuring word order similarity between translations and reference sentences. By correlating all metrics with a novel method for eliciting human judgements of reordering quality, we show that current metrics are largely influenced by lexical choice, and that they are not able to distinguish between different reordering scenarios. Also, we show that permutation distance metrics correlate very well with human judgements, and are impervious to lexical differences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Scratching the Surface of Possible Translations

Language Independent Evaluation of Translation Style and Consistency: Comparing Human and Machine Translations of Camus’ Novel “The Stranger”

References

Birch A, Osborne M, Koehn P (2008) Predicting success in machine translation. In: Proceedings of the empirical methods in natural language processing
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluation the role of BLEU in machine translation research. In: Proceedings of EMNLP
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 136–158
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, OH, pp 70–106
Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 1–28
Giménez J, Màrquez L (2007) Linguistic features for automatic evaluation of heterogenous MT systems. In: ACL workshop on statistical machine translation
Hirschberg D (1975) A linear space algorithm for computing maximal common subsequences. In: Communications of the ACM, pp 341–343
Kendall M, Dickinson Gibbons J (1990) Rank correlation methods. Oxford University Press, New York
MATH Google Scholar
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the association for computational linguistics companion demo and poster sessions, Prague, Czech Republic, pp 177–180
Lapata M (2003) Probabilistic text structuring: experiments with sentence ordering. Comput Linguist 29(2): 263–317
Google Scholar
Lapata M (2006) Automatic evaluation of information ordering: Kendall’s Tau. Comput Linguist 32(4): 471–484
Article Google Scholar
Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the workshop on statistical machine translation at the meeting of the association for computational linguistics (ACL-2007), pp 228–231
Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the human language technology conference of NAAC, pp 104–111
Lin C-Y, Och F (2004) Orange: a method for evaluating automatic evaluation metrics for machine translation. In: Proceedings of the conference on computational linguistics, 501 pp
Padó S, Galley M, Manning CD, Jurafsky D (2009) Textual entailment features for machine translation evaluation. In: the EACL workshop on machine translation (WMT)
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the association for computational linguistics, Philadelphia, USA, pp 311–318
Ronald S (1998) More distance functions for order-based encodings. In: the IEEE conference on evolutionary computation, pp 558–563
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA
Ulam S (1972) Some ideas and prospects in biomathematics. In: Annual review of biophysics and bioengineering, pp 277–292

Download references

Author information

Authors and Affiliations

University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, UK
Alexandra Birch & Miles Osborne
University of Oxford, Oxford, UK
Phil Blunsom

Authors

Alexandra Birch
View author publications
You can also search for this author in PubMed Google Scholar
Miles Osborne
View author publications
You can also search for this author in PubMed Google Scholar
Phil Blunsom
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandra Birch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Birch, A., Osborne, M. & Blunsom, P. Metrics for MT evaluation: evaluating reordering. Machine Translation 24, 15–26 (2010). https://doi.org/10.1007/s10590-009-9066-5

Download citation

Received: 10 May 2009
Accepted: 28 November 2009
Published: 07 January 2010
Issue Date: March 2010
DOI: https://doi.org/10.1007/s10590-009-9066-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Metrics for MT evaluation: evaluating reordering

Abstract

Access this article

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Scratching the Surface of Possible Translations

Language Independent Evaluation of Translation Style and Consistency: Comparing Human and Machine Translations of Camus’ Novel “The Stranger”

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Metrics for MT evaluation: evaluating reordering

Abstract

Access this article

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Scratching the Surface of Possible Translations

Language Independent Evaluation of Translation Style and Consistency: Comparing Human and Machine Translations of Camus’ Novel “The Stranger”

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation