An Approach to Word Sense Induction for Sinhala Language

Dissanayake, M. Lishani Sadna

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/3673

Title:	An Approach to Word Sense Induction for Sinhala Language
Authors:	Dissanayake, M. Lishani Sadna
Keywords:	word sense disambiguation Sinhala Language Word sense induction Semantics
Issue Date:	8-Sep-2016
Abstract:	In the area of Natural language processing (NLP), semantic processing takes a main place. Since it pave the ways for lot of NLP fields and its applications. Among those semantic processing tasks, word sense induction and disambiguation is a fundamental task. Our research is the first attempt inducing word meaning from raw Sinhala text. In our research we are focusing on word sense induction which automatically attempts to determine the correct senses for an ambiguous word when a word has multiple meanings. In order to do that as the first approach we are extracting meaning from sentences not only for ambiguous words but for all the Sinhala words. There are lots of word sense induction and disambiguation mechanisms for well-resourced languages such as English, because of the availability of lexical resources such as Wordnet. For Sinhala however, due to the lack of such lexical resources, we cannot directly use word sense disambiguation methodologies. On the other hand implementing such a network of senses manually is highly time-consuming. In order to identify a correct sense for an ambiguous word we have to have a sense inventory. In order to do that we can use a word sense induction procedure and build a sense inventory to accomplish this task. For this process, with the help of unsupervised learning algorithm we can find relevant senses for a given word. Using this Sinhala words sense inventory we can create a resource to determine the word senses from raw text. As the unsupervised learning methods we use word2vec, an already developed neural network based algorithm. Word2vec is a tool which has been developed using continuous bag-of-words and skip-gram features for training. Another algorithm is developed with the help of a concept of a powerful property which is there for human language.  One sense per collocation – adjacent words or neighboring words provide consistent clues for the sense of a given word, conditional on relative distance, order and syntactic relationship. As the first step in order to determine the correct sense of an ambiguous word in a text, initially we create a vector representation of that word. That vector space model thus iv created is two dimensional and includes Sinhala words and the count of Sinhala words which are occurring nearby. Initially this distance between the words will be one but later we can give the size of the distance as more than one, so then with the increasing window size the occurrence of two words with nearby will increase. Therefore we can expect the results will be much more efficient. The idea of this process is to identify the correct sense of an ambiguous word by looking at the context of that ambiguous word and by referring the vector space we are building we can get an idea about which sense which is most suitable for that ambiguous word. When implementing this vector space model, initially using the corpus the words will be put into a two dimensional matrix. For the x-axis and for the y-axis all the words will be put into these two axis. Then we obtain count of words that occur nearby with each word according to a given distance. Before doing this however, pre-processing was performed on words in the corpus. All the words were taken into account, therefore the words will be replaced by their stems. Another model we used, that has been already implemented, is the word2vec algorithm by Google. From our raw corpus which includes 450 000 words, first we extracted our vector space model for each word in the corpus. Since we cannot evaluate each and every word in the model, we extracted the first 1000 words to check whether the senses or meanings returned by our model are reasonable. We also obtained senses and meanings from the word2vec model and compared with our vector space model results. In order to do that we are used 40 words on both models, and compared the outputs. We categorized these 40 words from the results which we got from our vector space model. The first 20 words gave correct senses or meanings for the words. The other 20 words don’t give correct senses for the words. After putting these 40 words as input to the word2vec model we can observe that it doesn’t give very good results compared with our vector space model. Lastly we can give these models linguists to check their opinion regarding the outputs. From the results and evaluation we conclude that our implemented word vector model is giving more meaningful senses than the word2vec model. Still we cannot come to a conclusion that one model is better since we have taken only a small sample test set from the entire corpus. For training these models all the words in the raw corpus have been v considered. From this evidence we can say that the results have to be more accurate. Therefore we can come to a conclusion we can induce word senses for Sinhala language using word vectors that have been implemented using adjacent occurring words.
URI:	http://hdl.handle.net/123456789/3673
Appears in Collections:	SCS Individual Project - Final Thesis (2015)

Files in This Item:

File	Description	Size	Format
11001412_Thesis_Final.pdf Restricted Access		2.07 MB	Adobe PDF	View/Open Request a copy

Show full item record