Learning a wide-coverage generalized classifier model for Sinhala morphology

Shanilka, K.W.S.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4641

Title:	Learning a wide-coverage generalized classifier model for Sinhala morphology
Authors:	Shanilka, K.W.S.
Issue Date:	26-Aug-2022
Abstract:	Morphological analysis is of utmost importance in almost all upstream Natural Language Processing Natural Language Processing (NLP) tasks. Sinhala is the native language of a majority of people in Sri Lanka and is a morphologically rich agglutinative language. This calls for a robust approach for wide-coverage stem or lemma extraction from free text. Though there have been attempts to perform such morphological analysis using linguist-specified rules, data-driven methods using computational algorithms are needed to deal with unforeseen word-form occurrences. Hence the need to be able to learn morphology from freely occurring text using computational algorithms. In this study, I explore the use of deep learning algorithms based on vanilla Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) to learn mappings between word forms and their corresponding stems or roots. With the use of a previously generated data-set comprising of a list of word-root pairs, the tensorflow backend of the keras open source library was used in the implementations of the above mentioned models to experiment the automatic discovery of these mappings. The learnt models are also tested on previously unseen Sinhala words. Several experiments were carried out using different feature representations, deep learning architectures and various loss functions. An accuracy of just under 72% was obtained for the Simple RNN model, while accuracies of around 75% were obtained for the LSTM and GRU models. It can be seen that GRU and LSTM models can generate suitable computational models for wide-coverage lemmatizers for the Sinhala language. An error analysis was performed on the incorrect lemmatizations generated during the experiments, and categorized into five types. Out of them one-third of the errors were owing to the mapped lemma being the same as the original word, while a further significant proportion of errors pertained to situations where the predicted word is not a substring of the original word. Future work should focus on addressing these two types of errors and obtaining more optimized results.
URI:	https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4641
Appears in Collections:	2021

Files in This Item:

File	Description	Size	Format
2018 MCS 084.pdf		542.14 kB	Adobe PDF	View/Open

Show full item record