Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/1808
Title: Tesseract Based Optical Character Recognizer for Sinhala Scripts
Authors: Dharmabandu, T.S.M.
Issue Date:  12
Abstract: Optical character recognition (OCR) is the process of converting scanned document image into editable document. OCR is widely concerning research area in the eld of arti cial intelligence. The use of OCR has led to develop several open source and commercial OCR systems for Latin scripts. However, to date no perfect commercial or open source OCR system available for Sinhala language. Sinhala is the National language in Sri Lanka and it is related to Brahmi scripts derived from India which is round shape and context sensitive in nature. Due to these properties most of Latin based OCR technologies cannot apply to the Sinhala language because it has its own way of writing characters. Most of algorithms worked for Latin scripts may not work for Sinhala scripts due to certain reasons. In the eld of open source OCR systems Tesseract is one of the most accurate open source optical character recognition system which was developed under HP labs and later powered by Google. According to the recent study Tesseract able to recognize more than 20 Latin Scripts and most of Indian languages. This ability is because Tesseract has its own way of training for new languages. This feature was a key feature that leads Tesseract to recognize Indic script documents. Recognition of Sinhala scripts is a great challenge for text recognition using OCR techniques since Sinhala has one of the largest alphabet and di erent combinations of characters. The main contribution of this thesis work is development of rst Tesseract based recognition system for Sinhala language. A new approach has been evaluated using various samples of doc- uments provided by Language Technology Research Laboratory in UCSC including di erent font sizes and font types. The trained system used several post processing techniques to improve the accuracy of the output text. There for it has achieved accuracy up to 97 percent for some Unicode supported fonts which also have a great potential for achieving good performance on real-world data.
URI: http://hdl.handle.net/123456789/1808
Appears in Collections:SCS Individual Project - Final Thesis (2012)

Files in This Item:
File Description SizeFormat 
38.pdf
  Restricted Access
1.13 MBAdobe PDFView/Open Request a copy


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.