Abstract
This paper presents a wide range of statistical word alignment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS-tagging, lemmatization or stemming, we explore which linguistic information helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing morphosyntactic information are bigger in case of data scarcity, but significant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Smadja, F.A., McKeown, K.R., Hatzivassiloglou, V.: Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22, 1–38 (1996)
Diab, M., Resnik, P.: An unsupervised method for word sense tagging using parallel corpora. In: Proc. of the Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 255–262 (2002)
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proc. of the 1st International Conference on Human Language Technology Research (HLT), pp. 161–168 (2001)
Kuhn, J.: Experiments in parallel-text based grammar induction. In: Proc. of the 42th Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 470–477 (2004)
Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263–311 (1993)
Zens, R., Och, F.J., Ney, H.: Phrase-based statistical machine translation. In: Jarke, M., Koehler, J., Lakemeyer, G. (eds.) KI 2002. LNCS, vol. 2479, p. 18. Springer, Heidelberg (2002)
Mariño, J., Banchs, R., Crego, J.M., de Gispert, A., Lambert, P., Fonollosa, J., Ruiz, M.: Bilingual n-gram statistical machine translation. In: Proc. of Machine Translation Summit X, Phuket, Thailand, pp. 275–282 (2005)
Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29, 19–51 (2003)
Yamada, K., Knight, K.: A syntax-based statistical translation model. In: Proc. of the Annual Meeting of the Association for Computational Linguistics, Toulouse, France (2001)
Och, F., Ney, H.: A comparison of alignment models for statistical machine translation. In: Proc. of the 18th Int. Conf. on Computational Linguistics, Saarbrucken, Germany, pp. 1086–1090 (2000)
Toutanova, K., Ilhan, H.T., Manning, C.D.: Extensions to hmm-based statistical word alignment models. In: Proc. of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA (2002)
Tiedemann, J.: Combining clues for word alignment. In: Proc. of the 10th Conf. of the European Chapter of the ACL (EACL), Budapest, Hungary (2003)
de Gispert, A.: Phrase linguistic classification and generalization for improving statistical machine translation. In: Proc. of the ACL Student Research Workshop, pp. 67–72 (2005)
Popović, M., Ney, H.: Improving word alignment quality using morpho-syntactic information. In: Proc. of the 20th Int. Conf. on Computational Linguistics, COLING 2004, Geneva, Switzerland, pp. 310–314 (2004)
Popović, M., Ney, H.: POS-based word reorderings for statistical machine translation. In: Proc. 5th Int. Conf. on Language Resources and Evaluation (LREC), Genoa, Italy, pp. 1278–1283 (2006)
Costa-jussà, M., Crego, J., de Gispert, A., Lambert, P., Khalilov, M., Banchs, R., Mariño, J., Fonollosa, J.: Talp phrase-based statistical translation system for european language pairs. In: Proc. of the HLT/NAACL Workshop on Statistical Machine Translation, New York (2006)
Brants, T.: Tnt — a statistical part-of-speech tagger. In: Proc. of Applied Natural Language Processing (ANLP), Seattle, WA (2000)
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K., Tengi, R.: Five papers on wordnet. Special Issue of International Journal of Lexicography 3, 235–312 (1991)
Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proc. of the 4th Int. Conf. on Linguistic Resources and Evaluation (LREC), Lisbon, Portugal (2004)
Lambert, P., de Gispert, A., Banchs, R., Mariño, J.: Guidelines for word alignment and manual alignment. Language Resources and Evaluation (2006), doi:10.1007/s10579-005-4822-5
Och, F.: Giza++: Training of statistical translation models (2000), http://www.fjoch.com/GIZA++.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
de Gispert, A. et al. (2006). Improving Statistical Word Alignments with Morpho-syntactic Transformations. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_38
Download citation
DOI: https://doi.org/10.1007/11816508_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)