[PDF][PDF] Domain adaptation via pseudo in-domain data selection

A Axelrod, X He, J Gao - Proceedings of the 2011 conference on …, 2011 - aclanthology.org
Proceedings of the 2011 conference on empirical methods in natural …, 2011aclanthology.org
We explore efficient domain adaptation for the task of statistical machine translation based
on extracting sentences from a large generaldomain parallel corpus that are most relevant to
the target domain. These sentences may be selected with simple cross-entropy based
methods, of which we present three. As these sentences are not themselves identical to the
in-domain data, we call them pseudo in-domain subcorpora. These subcorpora–1% the size
of the original–can then used to train small domain-adapted Statistical Machine Translation …
Abstract
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora–1% the size of the original–can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in-and general-domain systems during decoding.
aclanthology.org
Showing the best result for this search. See all results