Identifying the ethno-nationality of English bloggers using deep learning

Mendis, B. S. G

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4399

Title:	Identifying the ethno-nationality of English bloggers using deep learning
Authors:	Mendis, B. S. G
Issue Date:	3-Aug-2021
Abstract:	English language is one of the most widely used languages in the world. But it is not the native language for the majority of the people. Therefore they give away certain cues such as grammar, spelling, sentence structure, frequently used phrases, etc., which separates them from native English speakers. Previous research that has been done in the area includes finding these afore mentioned cues also known as stylistic features between native and non-native English authors. These features were later used in identifying the native language of non-native English authors. However, none of those researches address the localization of English language, which occurs when a set of common stylistic features, unique to a specific locale gets absorbed by the English language. The corpora they have used are from learner corpora where authors are highly dependent on their native language. Therefore stylistic features found in written text by authors from different ethno-nationality with the same native language and cannot be identified. This research was done to uniquely identify the ethno-nationality of random authors with varied knowledge in English, based on their stylistic features, which can be used to understand the localization of English language. The problem was addressed as a machine learning problem. Supervised learning algorithms such as Support Vector Machines (SVM), decision trees, random forest, etc. has been used before in similar work in the area. However there is a lack of using neural networks for this type of study. Neural networks have the ability to learn and model non-linear and complex relationships. It can deduce unseen relationships on unseen data which helps the model generalize and predict. This would be helpful in figuring out new stylistic features in written text. Out of many neural network models, A Long Short Term Memory (LSTM) model was chosen as it preserve the error which can be backpropagated through time and layers. There is no straight forward method to extract the stylistic features that were identified or memorized the LSTM model. Therefore as part of the research, a method involving the weights of cells in the LSTM model was developed to extract the stylistic features. At the end of this research a model that is able to predict the ethno-nationality of an anonymous blogger with an accuracy of 62.9% was produced. In addition to that, a set of stylistic features of native authors, non-native authors, South Asian authors and Sri Lankan authors were also identified. These stylistic features can be used by linguists to further study the usage of English language.
URI:	http://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4399
Appears in Collections:	2019

Files in This Item:

File	Description	Size	Format
2016MCS060.pdf		1.02 MB	Adobe PDF	View/Open

Show full item record