Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla
Por favor, use este identificador para citar o enlazar este ítem:
http://hdl.handle.net/10045/76023
Título: | Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla |
---|---|
Autor/es: | Dandapat, Sandipan | Lewis, William |
Palabras clave: | Machine Translation |
Área/s de conocimiento: | Lenguajes y Sistemas Informáticos |
Fecha de publicación: | 2018 |
Editor: | European Association for Machine Translation |
Cita bibliográfica: | Dandapat, Sandipan; Lewis, William. “Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla”. In: Pérez-Ortiz, Juan Antonio, et al. (Eds.). Proceedings of the 21st Annual Conference of the European Association for Machine Translation: 28-30 May 2018, Universitat d'Alacant, Alacant, Spain, pp. 109-117 |
Resumen: | A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping. |
URI: | http://hdl.handle.net/10045/76023 |
ISBN: | 978-84-09-01901-4 |
Idioma: | eng |
Tipo: | info:eu-repo/semantics/conferenceObject |
Derechos: | © 2018 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND. |
Revisión científica: | si |
Versión del editor: | http://eamt2018.dlsi.ua.es/proceedings-eamt2018.pdf |
Aparece en las colecciones: | EAMT2018 - Proceedings |
Archivos en este ítem:
Archivo | Descripción | Tamaño | Formato | |
---|---|---|---|---|
EAMT2018-Proceedings_13.pdf | 1,63 MB | Adobe PDF | Abrir Vista previa | |
Este ítem está licenciado bajo Licencia Creative Commons