Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4172
Title: EXPLORING NEURAL MACHINE TRANSLATION FOR SINHALA-TAMIL LANGUAGE PAIR
Authors: NISSANKA, L.N.A.S.H.
Keywords: Translator
Monolingual corpora data
Parallel corpus data
Neural machine translation
Issue Date: 19-Jul-2021
Abstract: In the face of rapid globalization, the concept of translation performs the most important role in continuing the existence of native languages. Most of the research on Natural Language Processing in Neural Machine Translation has achieved an impressive result through parallel corpus dataset. Low resourced languages confront with low performance due to the lack amount of parallel corpus data. Creating parallel corpus for language pair is more expensive and needs the persons who are expert knowledge for both languages. Sinhala and Tamil consider as low resourced languages due to the lack of linguistics references. Conversely, monolingual corpora data is much easier to find than parallel corpus data and many languages with a limited amount of parallel corpus may perform prominent result with monolingual corpora. The main aim of this research is to develop a translator for Sinhala and Tamil languages pair using monolingual corpora data. To best of our knowledge, there are no researches which used only monolingual corpora data for developing the translation between Sinhala and Tamil language pair. In the first part of the research, we have examined the performance in Sinhala-Tamil translation using both parallel and monolingual corpora. We employed the dataset for having the sub-word unit. Sinhala and Tamil consider as morphologically rich languages. Then, we required to apply proper treatment to these languages to reduce their morphological rich nature. We conducted to address the scarcity of data we manipulated back-translation techniques and analyzed its applicability on the translation of Sinhala and Tamil languages.Through our experiment, we received 30 BLEU score for the Tamil to Sinhala translation. As a second part of the paper, we only examined the unsupervised approach using monolingual corpora for Sinhala-Tamil translation.For this procedure, we used the word embedding technique with segmentation technique for data preparation. we conducted our experimented with a shared-encoder with two decoders architecture. Through this approach, we were able to show that there are availabilities for creating translation only using monolingual corpora data.
URI: http://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4172
Appears in Collections:2019

Files in This Item:
File Description SizeFormat 
2015 CS 091.pdf2.07 MBAdobe PDFView/Open


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.