Google Scholar

[PDF][PDF] Domain adaptation via pseudo in-domain data selection

A Axelrod, X He, J Gao - Proceedings of the 2011 conference on …, 2011 - aclanthology.org

Proceedings of the 2011 conference on empirical methods in natural …, 2011•aclanthology.org

Abstract

We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora–1% the size of the original–can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in-and general-domain systems during decoding.

aclanthology.org

Show moreShow less

Save Cite Cited by 658 Related articles All 10 versions View as HTML

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

[PDF][PDF] Domain adaptation via pseudo in-domain data selection