Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4582
Title: Detecting Intrinsic Plagiarism Using Text Analytics
Authors: Abeykoon, B. B. D. S.
Issue Date: 23-May-2022
Abstract: With wide access to information and services, the detection of originality has become a serious problem that universities and other organizations have to increasingly pay attention to. Plagiarism is the term in used to categorize non-original content passed off as one's authentic contribution. Several services for the detection of plagiarism rely on massive archives of existing written work against which any original work is compared. While these services can be quite expensive, there is also a need to be able to detect plagiarism without access to such archives. This is known as intrinsic plagiarism detection, a document is analysed to distinguish any anomalies that exist in its own overall writing style. This study is focused on the identification of intrinsic plagiarism which aims to learn significant features that would help a machine learning algorithm to detect anomalous sections in a given document. Documents were selected from the three broad domains of global warming, civics and health and rubber plantation, which were written by single authors. After initial preprocessing, paragraphs written by other authors on the same domain were added in order to simulate the intrinsic plagiarism scenario. The result was an imbalanced dataset and a model is built with the stylistic features. The One Class SVM algorithm was used for classification with the ‘Author’ class and the ‘Non-author’ class as labels. Lexical features and the POS tags were extracted from the text as features and the best ten features were selected among them. The model was implemented on all of the features and the best features were compared. The results were obtained with the performance measures of validation accuracy, f1 score, precision, and recall. In addition, the accuracies were compared with the Naive Bayes classifier, SVM classifier, and Logistic Regression classifier at character level, word level, tf-idf level, and n-gram level in the context of bag-of-words. The final results were evaluated and the validation accuracy for the model built with the best features is 51.18% reagrding one-class svm classifier with stylistic features. Hence the ten best features we selected significantly impact the accuracy of the model. In the context of bag-of-words, the highest validation accuracy for Naive Bayes classifier was obtained for the count vectors and the value was 94.87%. The highest validation accuracy was retrieved as 93.59% for the count vectors in logistic regression classifier and regarding the svm classifier also, counter vectors showed 88.89% of highest validation accuracy. Though model accuracy is below expected, further improvements can be expected with more data and the application of newer deep learning models.
URI: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4582
Appears in Collections:2021

Files in This Item:
File Description SizeFormat 
2018 BA 001.pdf3.69 MBAdobe PDFView/Open


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.