Exploring Model Level Transfer Learning For Improving Sinhala Speech Recognition

Nanayakkara, A.L.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4703

Title:	Exploring Model Level Transfer Learning For Improving Sinhala Speech Recognition
Authors:	Nanayakkara, A.L.
Keywords:	Recurrent Neural Network Transfer Learning Data Augmentation Language Optimization Speech recognition
Issue Date:	22-Jun-2023
Abstract:	Automatic Speech Recognition (ASR) is the process which accurately translate spoken utterances into its corresponding textual format. However, in ASR, it will only translate the given speech data into text and not worry on the semantic aspect on it. Through accurate ASR we can easily build an interface for the both illiterate and literate users. Anyway, ASR gives better results for the most widely used data rich languages likes, English and German, but not in the data scarcity languages like Sinhala. During past few years many researchers conducted several studies on developing more accurate ASR for Sinhala, but they failed to succeeded on it due to low resource problem. This project represents a study to build ASR system for a low resource Sinhala language, which is known as morphologically rich complex language. To tackle the data scarcity issue, we have used new mechanism called transfer learning. It is capable to transfer knowledge from data rich model to data scarce model. We carry out several experiments on Sinhala speech recognition on DeepSpeech by considering various aspects such as applications on language optimizations, external scorer and data augmentations. Initially we start our experiments on transfer learning from pre-trained English to Sinhala without considering any data augmentation and achieved 22.92% in WER and 8.84% in CER. Later, when we applied data augmentation on the transfer learned model then it showed drastically reduction on WER and CER with compared to the initial models. It showed 17.19% for the WER and 5.9% in CER for the model which consists of 10% of reverb together with 30% of overlay augmented type with others on 40% in each by considering its default values and it is explained in this document. Experiments were conducted for the Sinhala speech dataset gathered from Language Technology Research Laboratory at UCSC. It consists of 40 hours of data coverage including both male and female speakers, which were recorded with the support from Praat and RedStart tools. All the experiments conducted by using it with the external support on 4-gram language model which build on KenLM toolkit. Finally, in the user evaluation it gives fairly good results for our model.
URI:	https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4703
Appears in Collections:	2022

Files in This Item:

File	Description	Size	Format
2019 MCS 062.pdf		1.72 MB	Adobe PDF	View/Open

Show full item record