Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/3106
Title: An Ontology-based and Domain Specific Clustering Methodology for Financial Documents
Authors: Kulathunga, C.S.
Issue Date: 21-May-2015
Abstract: Financial documents play an important role in modern nancial analysis and information retrieval tasks. In order to accomplish various investigational needs, nancial organizations continuously search for accurate and meaningful unsupervised document classi cation techniques. Nevertheless, unsupervised document categorization or document clustering is a challenging problem studied by many scientists. Generally, document clustering can be divided into several phases and the performances of those individual phases ultimately a ect the nal clustering results. In this study, we propose a novel clustering methodology for nancial documents through enhancing the feature selection phase of the clustering procedure. Incorporating semantic knowledge from an ontology into document clustering has been extensively studied and it has provided enhanced clustering performances. The incorporated semantic knowledge is generally used for identifying the semantic meanings of the words in the documents. Nevertheless, most of the proposed methodologies were experimented on general document datasets and most of the few available domain speci c clustering studies are restricted to some speci c domains such as medical and engineering where complete domain ontologies are available. Although nancial domain has several domain ontologies, none of them are complete and appropriate for semantic document clustering. In this context, our study proposes a document clustering methodology for nancial documents which adapt WordNet ontology to the nancial domain to serve as an external knowledge source. This study empirically shows that nouns are relatively prevalent and more important for document clustering rather than other terms in a document. Afterwards, a subset of nouns is identi ed as most important for the clustering based on their frequency distribution within the main noun list. We develop a word sense disambiguation technique which uses ontological knowledge for noun disambiguation. Finally, nouns in each document are disambiguated with the proposed word sense disambiguation technique, associated with tf-idf weights and clustered. On the basis of the empirical results of this research, it can be concluded that the proposed methodology can signi cantly enhance the clustering performance compared to no disambiguation and pure WordNet based disambiguation approaches.
URI: http://hdl.handle.net/123456789/3106
Appears in Collections:SCS Individual Project - Final Thesis (2014)

Files in This Item:
File Description SizeFormat 
K.A.C.S Kulathunga - 10001662.pdf
  Restricted Access
1.15 MBAdobe PDFView/Open Request a copy


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.