Topic Modelling on Natural Language Corpus Data

Mahath, T M I

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/3725

Title:	Topic Modelling on Natural Language Corpus Data
Authors:	Mahath, T M I
Issue Date:	15-Sep-2016
Abstract:	Social network has become dominant information source where wide varieties of users engage with interesting discussions about public postings and content sharing. Day by day information grows constantly and we do not have the potential human power to go through all the information. For our research, we chose Twitter which is a natural language processing source, diverse in subject matter and contains unstructured information. Due to the restricted length of tweet messages standard topic modeling tools fail to cater their full potential. The research aims to discover how topic modeling can be performed on short texts and identify the important topics that are occurred on given period. As a methodology we used Latent Dirichlet Allocation (LDA) based extended topic models to explore the hidden structure of tweets. Focusing on LDA, we also discuss how inference method can be benefited to model the information. We have used Apache Spark‘s LDA implementation to perform our topic modeling tasks. We evaluate our results using different data schemes and compare their quality and accuracy and discussed how to train and test LDA on twitter corpus. We have shown that by training a topic model on aggregated short messages, we can obtain a better classification on twitter data corpus. Our study also shows the relation between words and how they evolve through time.
URI:	http://hdl.handle.net/123456789/3725
Appears in Collections:	Master of Computer Science - 2016

Files in This Item:

File	Description	Size	Format
Thesis_13440471.pdf Restricted Access		2.07 MB	Adobe PDF	View/Open Request a copy

Show full item record