Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4184
Title: Improving Sinhala – Tamil Translation through Deep Learning Techniques
Authors: Arukgoda, A.S.
Issue Date: 22-Jul-2021
Abstract: Neural Machine Translation (NMT) is currently the most promising approach for machine translation. Many languages have successfully achieved their state-of-the art translation accuracy with NMT. But still, due to the data-hungry nature of NMT, many of the lowresourced language pairs struggle to apply NMT and generate intelligible translations. Additionally, when the language pair is morphologically rich and also when the corpora is multi-domain, the lack of a large parallel corpus becomes a significant barrier. This is because morphologically rich languages inherently have a large vocabulary, and inducing a model for such a large vocabulary requires much more example parallel sentences to learn from. In this research, we investigated translating from and into both a morphologically rich and a low resourced language pair, Sinhala and Tamil. To address the morphological richness, as proposed by previous work we have analyzed different sub-word segmentation techniques. We conducted a detailed analysis on these techniques with Sinhala and Tamil which also helped us learn some linguistic properties of the two languages. Furthermore, to address the scarcity of data, we employed one of the most popularly used techniques called back-translation and analyzed its applicability on the translation of Sinhala and Tamil languages in both directions. In this process we designed a new language-independent technique that performs well when the monolingual sentences are limited and could support the translation of one direction on the translation of the other direction, given two languages. Through the course of our experiments we were able to gain an improvement of approximately 11 BLEU points for Tamil to Sinhala translation and an improvement of 7 BLEU points for Sinhala to Tamil translation over our baseline systems. Being a challenging language pair that has not been explored with NMT before with an opendomain data set, the above improvement is statistically significant and contributes towards automatic translation between these two languages.
URI: http://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4184
Appears in Collections:2018

Files in This Item:
File Description SizeFormat 
2014CS007.pdf1.65 MBAdobe PDFView/Open


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.