Conference Proceedings


Julie Carson-Berndsen
Andy Way
Jie Jiang



machine translating statistical machine translation phrase based statistical machine translation score cleaning bilingual corpus lattice based parallel corpora

Lattice score based data cleaning for phrase-based statistical machine translation (2010)

Abstract Statistical machine translation relies heavily on parallel corpora to train its models for translation tasks. While more and more bilingual corpora are readily available, the quality of the sentence pairs should be taken into consideration. This paper presents a novel lattice score-based data cleaning method to select proper sentence pairs from the ones extracted from a bilingual corpus by the sentence alignment methods. The proposed method is carried out as follows: firstly, an initial phrasebased model is trained on the full sentencealigned corpus; then for each of the sentence pairs in the corpus, word alignments are used to create anchor pairs and sourceside lattices; thirdly, based on the translation model, target-side phrase networks are expanded on the lattices and Viterbi searching is used to find approximated decoding results; finally, BLEU score thresholds are used to filter out the low-score sentence pairs for the data cleaning purpose. Our experiments on the FBIS corpus showed improvements of BLEU score from 23.78 to 24.02 in Chinese-English.
Collections Ireland -> Dublin City University -> Publication Type = Conference or Workshop Item
Ireland -> Dublin City University -> Subject = Computer Science
Ireland -> Dublin City University -> DCU Faculties and Centres = Research Initiatives and Centres: Centre for Next Generation Localisation (CNGL)
Ireland -> Dublin City University -> Status = Published
Ireland -> Dublin City University -> Subject = Computer Science: Machine translating
Ireland -> Dublin City University -> DCU Faculties and Centres = Research Initiatives and Centres

Full list of authors on original publication

Julie Carson-Berndsen, Andy Way, Jie Jiang

Experts in our system

Andy Way
Dublin City University
Total Publications: 229