Conference Proceedings


Andy Way
Jinhua Du



translation quality machine translating french stroke chinese translation algorithm english machine translation

Pinyin as subword unit for Chinese-sourced neural machine translation (2017)

Abstract Unknown word (UNK) or open vocabulary is a challenging problem for neural machine translation (NMT). For alphabetic languages such as English, German and French, transforming a word into subwords is an effective way to alleviate the UNK problem, such as the Byte Pair encoding (BPE) algorithm. However, for the stroke-based languages, such as Chinese, aforementioned method is not effective enough for translation quality. In this paper, we propose to utilize Pinyin, a romanization system for Chinese characters, to convert Chinese characters to subword units to alleviate the UNK problem. We first investigate that how Pinyin and its four diacritics denoting tones affect translation performance of NMT systems, and then propose different strategies to utilise Pinyin and tones as input factors for Chinese–English NMT. Extensive experiments conducted on Chinese–English translation demonstrate that the proposed methods can remarkably improve the translation quality, and can effectively alleviate the UNK problem for Chinese-sourced translation.
Collections Ireland -> Dublin City University -> Publication Type = Conference or Workshop Item
Ireland -> Dublin City University -> DCU Faculties and Centres = DCU Faculties and Schools: Faculty of Engineering and Computing: School of Computing
Ireland -> Dublin City University -> DCU Faculties and Centres = Research Initiatives and Centres: ADAPT
Ireland -> Dublin City University -> Status = Published
Ireland -> Dublin City University -> Subject = Computer Science: Machine translating

Full list of authors on original publication

Andy Way, Jinhua Du

Experts in our system

Andy Way
Dublin City University
Total Publications: 229
Jinhua Du
Dublin City University
Total Publications: 38