Automated News Clustering Using an Unsupervised Learning Model

Fonseka, W. P. I.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4529

Title:	Automated News Clustering Using an Unsupervised Learning Model
Authors:	Fonseka, W. P. I.
Issue Date:	11-Aug-2021
Abstract:	Text clustering is an emerging research topic which involves two main subject areas; Natural Language Processing and Machine Learning. Clustering comes under unsupervised machine learning which is more complex in terms of both implementation and evaluation than its supervised variants. Especially when it comes to a dynamic domain like online news content, it is really difficult because of the unpredictability of cluster labels or number of clusters. With the absence of this prior knowledge, the required knowledge should be acquired from the data set itself. So, the focus of this research is to come up with a proper approach which addresses all these challenges and cluster online news items with higher degree of accuracy. The main challenge was plenty of noise due to lengthier content of news items. So, focus was given to minimize this noise and extract key features to improve the accuracy and performance of the clustering module. Text preprocessing and feature extraction techniques were experimented empirically under different parameter combinations, to find the optimal method and set of parameters which suits a noisy and lengthier text corpus like ours. Singular Valued Decomposition was used to reduce the dimensionality of the feature matrix, which improved the results generated by the clustering module drastically. K-means Elbow method together with Silhouette Score curves were used for finding the optimal number of clusters. The optimized model was evaluated using data with ground truth information and the quality of the generated clusters was verified using extrinsic methods like Jaccard Score and Adjusted Random Index. Data set for the experiment was generated by extracting sports news from CNN, BBC and Aljazeera news sites programmatically. When the model was tested using the entire data set without any ground truth information, generated clusters, for the identified optimal number of clusters (k), had a Silhouette Score over 0.4 and Sum of Squared Errors less than 60. Cluster quality measures were not up to expected values because of some overlapping clusters, though most of the clusters exhibited a meaning according to the majority of items assigned to them. The model generated quality results when tested with a sample data set with non-overlapping clusters. Silhouette score was around 0.9 and SSE was reduced to value less than 10 for this sample. Generated clusters for this sample were compared with the ground truth information using extrinsic methods and the calculated value for the Adjusted Random Index was 0.9694 which is an indication of an acceptable end result.
URI:	http://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4529
Appears in Collections:	2020

Files in This Item:

File	Description	Size	Format
2017 MCS 030.pdf		1.18 MB	Adobe PDF	View/Open

Show full item record