A Multi-Class Classification Approach for Native Language Identification of English Writers using Machine Learning and Natural Language Processing Techniques

Perera, D. N.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4459

Title:	A Multi-Class Classification Approach for Native Language Identification of English Writers using Machine Learning and Natural Language Processing Techniques
Authors:	Perera, D. N.
Issue Date:	5-Aug-2021
Abstract:	This research addresses a multi-class text classification problem using supervised learning approaches. The main focus of this research is to identify the native language of English authors using their written English text using machine learning (ML) and natural language processing (NLP) techniques. According to the literature for text classification most promising ML algorithms are Support Vector Machines (SVMs), Naïve Bayes, and Decision Trees. The main data corpora used for this research were Italki corpus and ICE corpus. Each corpus was divided into two; 70% of the data (seen data) was taken for model building and training and 30% of the data (unseen data) for testing and validation. Data were pre-processed to remove URLs, XML tags, multiple spaces, e-mail addresses, and censored data, etc. Testing have been conducted using Linear Support Vector Machines (LSVM) with Stochastic gradient descent (SGD), Multinomial Naïve Bayes (MNB), Decision Tree (DT) and Ensemble approach. 5- fold cross validation techniques have been used on training data to fine-tune the hyperparameters of the classifier. Further, learning curves were drawn to see whether the ML model is overfitting. Then the ML model has been tested using the unseen data. Three approaches were trained and tested, namely; flat classification, hierarchical classification and sub-category classification. Hierarchical classification received much higher performance when compared to flat classification approach. For the flat classification approach 81.80% accuracy for the ICE corpus and 42.50% for the Italki corpus were received. For the hierarchical classification approach 50.85% accuracy was achieved for the Italki corpus and 63.52% and 68.93% achieved for the Kachru’s model and the geographical distribution model respectively. Subcategorical classification approach achieved highest accuracy of 82.32% for the Non-Native Non Indo European Austronesian category in Italki corpus and 100% accuracy for the Expanded circle category in ICE corpus. When analysing the performance of each machine learning technique LSVM-SGD and MNB have given promising results when compared to DT and RF. In addition, significant features of sub-categories and specially for Sinhala in Italki corpus and Sri Lanka in ICE corpus have been derived and analysed.
URI:	http://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4459
Appears in Collections:	2020

Files in This Item:

File	Description	Size	Format
2016 MCS 076.pdf		18.99 MB	Adobe PDF	View/Open

Show full item record