Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/76023
Información del item - Informació de l'item - Item information
Título: Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla
Autor/es: Dandapat, Sandipan | Lewis, William
Palabras clave: Machine Translation
Área/s de conocimiento: Lenguajes y Sistemas Informáticos
Fecha de publicación: 2018
Editor: European Association for Machine Translation
Cita bibliográfica: Dandapat, Sandipan; Lewis, William. “Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla”. In: Pérez-Ortiz, Juan Antonio, et al. (Eds.). Proceedings of the 21st Annual Conference of the European Association for Machine Translation: 28-30 May 2018, Universitat d'Alacant, Alacant, Spain, pp. 109-117
Resumen: A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping.
URI: http://hdl.handle.net/10045/76023
ISBN: 978-84-09-01901-4
Idioma: eng
Tipo: info:eu-repo/semantics/conferenceObject
Derechos: © 2018 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND.
Revisión científica: si
Versión del editor: http://eamt2018.dlsi.ua.es/proceedings-eamt2018.pdf
Aparece en las colecciones:EAMT2018 - Proceedings

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
ThumbnailEAMT2018-Proceedings_13.pdf1,63 MBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons