Improving Performance of Statistical Machine Translation between Morphologically Rich and Low Resourced Language Pairs

Pushpananda, B. H. R.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4770

Title:	Improving Performance of Statistical Machine Translation between Morphologically Rich and Low Resourced Language Pairs
Authors:	Pushpananda, B. H. R.
Issue Date:	2017
Abstract:	ABSTRACT Statistical and machine learning methods have trumped rule-based approaches to machine translation (MT). Unfortunately however, most language pairs do not have acceptable MT systems due to the unavailability of adequate amounts of parallel data. In addition to this problem, morphological richness is a confounding factor that needs to be addressed in developing a successful MT system, especially for agglutinative languages. When any one or both languages requiring to be translated are both data scarce and also morphologically rich, developing a good MT system for them is very difficult. In this research, we investigated several approaches and techniques for translating from and into both morphologically rich and low resourced languages, namely between Sinhala and Tamil. According to the literature, integrating morphological information to an MT process will is one way to improve MT between morphologically rich language pairs. However, the unavailability of language resources such as morphological analyzers and part-of-speech taggers hamper progress for translations between such language pairs. Since this affects both Sinhala and Tamil languages we integrated unsupervised learning and transliteration approaches to a phrase based statistical MT approach in an attempt to overcome these issues. Initially, we performed three sets of experiments employing different morphological representations namely; a “fully-morpheme-like”, a “semi-morpheme-like with merged suffixes” and a “semimorpheme- like without merged suffixes”. Based on these approaches, the fullymorpheme- like and the semi morpheme-like without merged suffixes approaches have shown improvements of approximately 2.9 and 1.7 BLEU points respectively compared to the baseline phrase-based approach. We were also able to reduce the out-of-vocabulary rate from 25% to 1% using these techniques. Based on a purely transliteration approach, we were only able to achieve a 0.44 BLEU point increment compared to the baseline approach. We then combined the fully-morphemelike approach with the transliteration (direct-mapping) approach to build a “joint” model which was able to achieve a 4.69 BLEU point improvement over the baseline and a 1.83 BLEU point improvement over the fully-morpheme-like approach. iii We have also carried out a critical appraisal of the kernel based MT (KBMT) framework which had shown some promise in early Sinhala-Tamil (SI-TA) translation work. We explored its performance on a large French-English (FR-EN) corpus and our much smaller SI-TA corpus. We introduced two novel approaches to filter the training examples based on the input using cosine similarity values. Overall, the KBMT approach gives lower quality translations compared to other approaches we have tried in this research for the morphologically rich language pair Sinhala and Tamil. We finally compared the results obtained from our kernel based translator, our phrase-based baseline, the baseline augmented by our fully-morpheme-like segmentation approach, and our “joint” model with results obtained from Google Translator for the SI-TA language pair. Both manual and automatic evaluation techniques were used to measure the quality of the resulting translations. Based on an automatic evaluation, the Google Translator gives the lowest BLEU score compared to our other models. However, expert manual evaluation shows that the quality of the translation output by the Google Translator is somewhat closer to our baseline approach. Overall, our unsupervised fully morpheme-like segmentation approach in its “joint” form has shown the best BLEU score in both automatic and manual evaluations and is significantly better than the Google translator for the Sinhala-Tamil language pair.
URI:	https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4770
Appears in Collections:	2017

Files in This Item:

File	Description	Size	Format
PhD_BHR Pushpananda2017.pdf		10.4 MB	Adobe PDF	View/Open

Show full item record