Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4200
Title: Online Learning for Solving Data Availability Problem in Natural Language Processing
Authors: Kothalawala, B.W.
Issue Date: 26-Jul-2021
Abstract: Named Entity Recognition (NER) and Part of Speech (POS) tagging are major prerequisites for various NLP applications. The state-of-the-art performance of NER and POS tagging is obtained using statistical and machine learning (ML) techniques. Machine learning models require a large data corpus to achieve better performance. However, obtaining a large data resource at once is often not practical. In practice data is aquired incrementally, often as a sequence of minibatches. CRF, HMM, and MaxEnt models are the state-of-the-art ML models for NER and POS tagging. Since these models based on batch learning techniques, these models need to be retrained from scratch each time new data is added to the corpus. This study proposes to solve this problem using online machine learning techniques because it can deal with incrementally collected data sets. Two online learning models are constructed in this study: 1). Online Conditional Random Fields (CRF) and 2). Bidirectional Long Short Term Memory-Conditional Random Field(LSTM-CRF). The Sinhala NER experiment using the proposed Online CRF model achieved an F-measure value improvement from 31.4914% to 75.9259%. The F-measure value for Sinhala NER using the proposed Bidirectional LSTM-CRF model increased from 51.6180% to 79.8365%. Sinhala POS tagging using the proposed Online CRF model increased the accuracy from 70.7307% to 75.9147%. The Bidirectional LSTM-CRF model for Sinhala POS tagging increased the accuracy from 69.9681% to 76.0177%. The online learning models performance was able to match the batch learning models while retaining the flexibility of incremental model building. The training time of online learning remains nearly constant over the sequence of mini-batches while the training time for batch learning methods gave linearly increasing training time.
URI: http://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4200
Appears in Collections:2018

Files in This Item:
File Description SizeFormat 
2014CS067.pdf925.87 kBAdobe PDFView/Open


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.