Vision Based Approach for Sinhala, Tamil and English Script Identification

Jayasundara, N.N.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/3727

Title:	Vision Based Approach for Sinhala, Tamil and English Script Identification
Authors:	Jayasundara, N.N.
Issue Date:	15-Sep-2016
Abstract:	Optical character recognition is an important task for converting handwritten and printed documents to digital format. OCR is already have been very successful area when it comes to the recognition of documents which consists in a single language. But in a multi script environment, majority of the documents contain more than one script/language. For the processing of such documents through OCR, it is necessary to identify different script regions of the document. Most of the existing approaches to solve the aforementioned issue are based on different global or local methods for documents which consists of different set of language combinations. This project aims at using combination of local and global methods which mainly focusing texture based features to identify Sinhala, Tamil and English scripts of a document image in a word level. A solution based on Bag of Feature (BoF) algorithm is proposes to solve the problem. The solution initially obtains the set of words of the document image and BoF algorithm is then applied with the compared feature extractors in order to classify the words into Sinhala, English or Tamil. A novel algorithm is proposed to segment the words of the document which consists of above mentioned languages. This algorithm segments the words in the rate of sixty-five words per second. And then novel approach is proposed to create customized features which is used to feed into BoF. In the absence of a standard database, locally arranged two datasets with different extents are selected. After analysis is carried out, SURF feature based method is selected as the best approach. Despite the character lengths of words, font styles and font sizes, an accuracy of 93% is proven in the proposed solution. Considering evaluation results of the proposed solution, future work is also suggested to improve the performance of preprocessing stage of document images as well as to improve the recognition accuracy of BoF classification.
URI:	http://hdl.handle.net/123456789/3727
Appears in Collections:	Master of Computer Science - 2016

Files in This Item:

File	Description	Size	Format
ThesisFinal.pdf Restricted Access		2.54 MB	Adobe PDF	View/Open Request a copy

Show full item record