Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/3924
Title: Identification of Hate Speech in Social Media
Authors: Ruwandika, N.D.T.
Issue Date: 2017
Abstract: Abstract Hate speech on social media is a common issue seen at present which is growing really fast. Due to the growth of online hate content there’s a huge influence for the increase of hate crimes in the society. So, if an accurate efficient methodology can be found to control the online hate content, it will be a great relief to the society. This research represents a study carried out to compare different techniques for the task of hate speech identification of a local English dataset. A new dataset was created using comments published in a news site of Sri Lanka. We have achieved a dataset of 1500 comments which includes 421 hate comments and 579 comments without hate speech. Totally 1000 comments are annotated out of 1500 comments. We have evaluated and compared different classifiers with different features. At the same time an investigation was done to evaluate the accuracy of models when increasing the size of the dataset. Five machine learning models were implemented in order to accomplish the task. Since, this task is framed as a supervised learning task in current literature; an unsupervised learning model was also among the five models. Support vector machine, Logistic Regression algorithm, Naïve Bayes algorithm, Decision Tree algorithm and KMeans clustering algorithm were used to build the five classifier models. Bag of words, Tfidf and two more feature types were used as features. Google bad word list was used as hate lexicon to extract features from data. Naive Bayes classifier with Tfidf features was the best performing model with an F-score value of 0.719. It was observed that in almost all the considered scenarios supervised learning models performed better than unsupervised learning model. Since the amount of local data available for experiment is really law, extending the current data set is suggested as a future work. Mean time combining different feature types and evaluating the classifier models can also be done. And also, it is really important to create a lexicon relevant to English words used in Sri Lanka.
URI: http://hdl.handle.net/123456789/3924
Appears in Collections:SCS Individual/Group Project - Final Thesis (2017)

Files in This Item:
File Description SizeFormat 
13001051_Thesis.pdf1.16 MBAdobe PDFView/Open


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.