Towards an end-to-end OCR system for Sinhala

Ranasinghe, R.M.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/21

Title:	Towards an end-to-end OCR system for Sinhala
Authors:	Ranasinghe, R.M.
Keywords:	OCR Sinhala Optical Character Reader Research Subject Categories::TECHNOLOGY
Issue Date:	12-Sep-2013
Abstract:	One of the main objective of using the computer is availability and accessibility of the data. That is how easy to access the correct data with very limited time. To achieve this objective search option is very much important. In several industries there are large amount of data still in the printed format, which generally known as hard copy. This is not supported above functionality, especially main objective of using computer. Good examples are court Decisions, Legal Notes and law related materials belong to the several countries. Most of those data still remaining as had copies (printed format or as typewrite documents). It's very much important to attorneys and also the students in this field to access those data. But search a specific document on those formatters may take several hours. Also in newspaper industry bunch of old newspaper articles were not yet converted to soft document. Not only that but in all areas there are lost of books in printed format still and not converted to hard copies. One of the common approach is manually convert those data to soft format, this is time consuming and very expensive mechanism. But OCR technologies are done a good job and able to convert most of words (or letters) to soft format when the documentation are fed as image. But unfortunately for the Sinhala fonts still there are no any acceptable solutions, especially at industry level. The main Objective of this implementation is to develop an industry standard OCR system for Sinhala fonts. Solution of " Toward an end-to-end OCR system for Sinhala" willing to answer the main question of converting (fetch) Sinhala text from any image formats. Also wiling to identify a methodology which can increases the accuracy of the converting. Therefore a "Towards an end-to-end OCR system for Sinhala" was developed using an opensource API - "Tesseract" which is most enriched and commonly used technology. Also uses the Visual studio 2010 with Microsoft .NET framework 4.0 to build user-friendly Windows application. While implementing identified system is very much depended with font size and fonts faces of the feeding samples. When it used the same size (feeding sample size \ and reading sample) and also same faces of the fonts, application reach the achieve accuracy level above 98% Also this includes numerals and some joint characters in Sinhala fonts which was not consider at early researches. To reducing errors father, application incorporated with correct word set from UCSC language center (this include over 100,000 words which fileted from four million words.) And introduced the approach "Minimal Error Distance Toning" with knowledge of OCR for commonly mixed letters. Using the "Minimal Error Distance" approach with UCSC word list, above 75% of application OCR related error were corrected. As another advantage of this approach is, it can fixed most typo errors in source image.
URI:	http://hdl.handle.net/123456789/21
Appears in Collections:	Master of Computer Science - 2013

Files in This Item:

File	Description	Size	Format
Print_Chapters.pdf Restricted Access		1.73 MB	Adobe PDF	View/Open Request a copy

Show full item record