[PDF][PDF] Domain adaptation via pseudo in-domain data selection
Proceedings of the 2011 conference on empirical methods in natural …, 2011•aclanthology.org
We explore efficient domain adaptation for the task of statistical machine translation based
on extracting sentences from a large generaldomain parallel corpus that are most relevant to
the target domain. These sentences may be selected with simple cross-entropy based
methods, of which we present three. As these sentences are not themselves identical to the
in-domain data, we call them pseudo in-domain subcorpora. These subcorpora–1% the size
of the original–can then used to train small domain-adapted Statistical Machine Translation …
on extracting sentences from a large generaldomain parallel corpus that are most relevant to
the target domain. These sentences may be selected with simple cross-entropy based
methods, of which we present three. As these sentences are not themselves identical to the
in-domain data, we call them pseudo in-domain subcorpora. These subcorpora–1% the size
of the original–can then used to train small domain-adapted Statistical Machine Translation …
Abstract
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora–1% the size of the original–can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in-and general-domain systems during decoding.
aclanthology.org
Showing the best result for this search. See all results