Dependency Based Grammar Error Detection For Low Resource Languages

Rupesinghe, O V

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4927

Full metadata record

DC Field	Value	Language
dc.contributor.author	Rupesinghe, O V	-
dc.date.accessioned	2025-08-21T08:25:20Z	-
dc.date.available	2025-08-21T08:25:20Z	-
dc.date.issued	2025-06-29	-
dc.identifier.uri	https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4927	-
dc.description.abstract	Abstract This dissertation introduces a end-to-end, data-driven framework for automated grammar error detection (GED) in Sinhala, a low-resource, free-word-order language. We begin by constructing a 400 sentence Universal Dependencies (UD) treebank via a hybrid LLMand- expert annotation pipeline. Using custom 300 dimensional FastText embeddings, we train a graph-based UUParser with targeted data augmentation and cross-lingual transfer from Hindi. Our final parser achieves an Unlabeled Attachment Score (UAS) of 71.37% and Labeled Attachment Score (LAS) of 55.42%, and attains sentence-level parse accuracy of 82% on a standard correct corpus outperforming a leading CFG-based parser (60%) while retaining 64% accuracy on free-word-order variants . Building on this, we generate a synthetic GED corpus of 10,000 sentences covering five error types. We engineer multi-level token features—pretrained word embeddings, POS embeddings, morphological concatenations, dependency relation embeddings, and syntactic n-grams— and train a BiLSTM classifier. The combined model delivers 80% overall classification accuracy . On a 200 sentence standard evaluation set, it correctly classifies 82% versus 60% for prior CFG-based methods, and it generalizes to free-word-order GED with 64% accuracy. Our contributions include a UD treebank for Sinhala, an optimized dependency parser pipeline, the first dependency-enhanced GED classifier for Sinhala, and a synthetic error corpus. These results confirm that combining hierarchical dependency features with surface-level features significantly boosts GED in challenging low-resource, morphologically rich, free-word-order settings.	en_US
dc.language.iso	en	en_US
dc.title	Dependency Based Grammar Error Detection For Low Resource Languages	en_US
dc.type	Thesis	en_US
Appears in Collections:	2025

Files in This Item:

File	Description	Size	Format
20001525 - O V Rupesinghe - oshada rupasinghe.pdf		4.89 MB	Adobe PDF	View/Open

Show simple item record