Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/2483
Title: Towards A Shallow Parser for Tamil
Authors: Ariaratnam, I.
Issue Date: 20-May-2014
Abstract: Natural Language Processing is a computerized approach which concerns the computational aspects of the human languages with the aim of developing computational models that would analyze and understand the natural language sentences. Syntax is a level of linguistics which deals with structural representation of human languages. Generally natural language sentences are very rich in nature and are often ambiguous. Therefore, analyzing syntactic structure has still remained a challenging task even after numerous researches. To overcome this problem, Shallow Parsing is a solution designed to assign a partial structure to the natural language sentences in order to recover limited amount of syntactic information. It can be described as a combination of Part of Speech tagging (POS), process of automatically labeling each word in a sentence to the corresponding part of speech like noun, verb, adjective, adverb and Chunking, division of the text s sentences into syntactically correlated word groups such as noun phrase, verb phrase and prepositional phrase. It doesn t attempt to solve all the ambiguities which are semantically significant; instead it identifies the phrases from the sentences. The key thing is that shallow parsing is sufficient for many natural language processing applications such as information extraction, information retrieval and machine translation. In this thesis, we present a POS tagger using machine learning approach, Maximum Entropy model and a rule-based Chunker for Sri Lankan Tamil without need for a large size of training corpus to proceed. And to do this task, we have created our own POS tagset, manually POS annotated corpus of size approximately 12500 words collected from newspapers, list of handcrafted linguistic rules to extract the phrase level structures and manually chunked version of the test data for the evaluation purposes. Our POS tagger has demonstrated a promising accuracy of 81.72% for Sri Lankan Tamil with the best contextual features (previous two words, next word and the POS tag of the previous word), prefix length = 4 and suffix length = 6, addition of tag dictionary, digit features and symbol features. Our Chunker performs well with the recall = 79.0% and precision = 77.6%. The combination of POS tagger and Chunker, known as Shallow Parser chunks 78.9% of the words. Though our Chunker reports the results well with the F- measure = 78.3%, we have obtained the F-measure = 66.6% for the Shallow Parser, since the error from our POS tagger propagates.
URI: http://hdl.handle.net/123456789/2483
Appears in Collections:SCS Individual Project - Final Thesis (2013)

Files in This Item:
File SizeFormat 
9001727.pdf
  Restricted Access
1.76 MBAdobe PDFView/Open Request a copy


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.