Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4595
Title: Generating Contextual Word Embeddings for Sinhala
Authors: Silva, J. A. S. N.
Issue Date: 10-Jun-2022
Abstract: Word embeddings are an important aspect in Natural Language Processing (NLP) which tries to find a better representation of a word. One such application of word embedding is predicting the next word of a given phrase. Mostly word embedding techniques are applied for high resource languages. Word embeddings done for low resource languages are very rare. Sinhala language can be considered as a low resource language. Therefore, in this study we addressed the issue of predicting the next word of given phrase in Sinhala language. The data for the study were collected from the web and were preprocessed to remove the noise. The preprocessed data were fed into 3 types of models namely n-gram model, LSTM model (which builds the word embedding by ignoring the meaning of the word based on its context, therefore the true meaning of the sentence might get ignored) and RoBERTa model (which learns the meaning of the word based on its context). Best model has been selected by comparing the perplexity values. According to the results obtained among the n-gram models created the tri-gram model obtained the lowest perplexity value of 502.34. From the LSTM models created the LSTM model which considered four previous words for the prediction had a better performance (perplexity: 44.53). From the results obtained for BERT model it can be concluded that distilBERT model performs better compared to BERTbase model. The lowest perplexity value for distilBERT model was 3.19 at the epoch 30. Further, it can be concluded that RoBERTa model is better for language modelling compared to n-gram model and LSTM model.
URI: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4595
Appears in Collections:2021

Files in This Item:
File Description SizeFormat 
2018 BA 032.pdf1.23 MBAdobe PDFView/Open


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.