Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/3733
Title: Topic modelling for extracting dominant themes from diverse news sources
Authors: Gunawardena, V.L.
Issue Date: 15-Sep-2016
Abstract: The contemporary media landscape overload people with massive extents of on-line news, making it inconvenient for them to recognize the underlying dominant themes. The goal of this research is to propose a methodology to extract the dominant themes out of news articles that is available in Sri Lankan English news sites. This research employ LDA, a popular unsupervised statistical topic modeling technique that provides a convenient methodology for analyzing an unlabeled collection of text. This thesis contains an in-depth analysis on selection of an appropriate model to proceed with topic modelling. As the selected topic modelling technique, LDA, is conditioned upon the Dirichlet hyperparameters and the number of topics, further analysis and experimentation work has been undertaken in determining those parameter values. The Dirichlet hyperparameters were determined based upon research evidence, while an empirical approach was followed in determining the number of topics. In determining the number of topics, Dirichlet hyperparameters were fixed and the consequences of varying the number of topics were explored. Instead of relying entirely upon the statistical certainty of log-likelihood values obtained through experimentation, eyeballing of topics are also used in determining the most appropriate number of topics. Further the generated LDA model was used in inferring topic probability distributions for unseen news texts. The test results were evaluated based on an accuracy measure that is computed on the results obtained through human evaluation. Here the intention was to evaluate the human-identifiable semantic coherence of the discovered topics and the topic assignments. Analysis of the test results revealed that the extracted topics have successfully uncovered the semantic structure in the data and is consistent with the class designations provided by the users to an acceptable level of accuracy.
URI: http://hdl.handle.net/123456789/3733
Appears in Collections:Master of Computer Science - 2016

Files in This Item:
File Description SizeFormat 
13440242_Project Thesis.pdf
  Restricted Access
1.15 MBAdobe PDFView/Open Request a copy


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.