Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/3104
Title: Protein Function Prediction Using Machine Learning
Authors: Karunapala, E.D.S.C.
Issue Date: 21-May-2015
Abstract: Protein function prediction has gained a significant importance mainly for its numerous applications in the field of drug discovery. The drug discovery process starts with protein identification because proteins are responsible for many functions required for maintenance of life. Protein identification further needs determination of protein function. Since experimentally characterizing exponentially growing amount of sequences is much slower, a significant need has been created to apply computational methods for protein function prediction. Even though machine learning has been utilized in this context, most of them are based on black-box models where intermediate computational results are invisible. These intermediate steps may reveal significant biological facts. Hence white-box models can be crucial and those are yet to be explored for function prediction models. Even in very few existing approaches, extremely low number of sequences has been used. Considering these issues identified through the literature, this research study focuses on investigating the usage of white-box machine learning techniques for protein function prediction. In this study, we have used proteins represented by a set of enzymes for which white-box approaches are not yet explored. Expanding the feature pool, use significant features and use sufficient amount of instances are other key concerns of this study. Hence we constructed a reliable data set of 3034 instances considering only the reviewed information available in Protein Data Bank (PDB). Each enzyme is represented using 23 Sequence Derived Features (SDFs) derived using different bioinformatics tool and labelled according to the Enzyme Classification scheme. After applying a comprehensive feature engineering process in order to transform the constructed data set in to a format that is compatible with the classifiers we trained each classifier with optimal parameter configuration and evaluated the performance. Finally the best white-box model (C4.5) shows an accuracy of 61% and the best black-box model (Random Forest) shows an accuracy of 71%. Keywords:- Protein Function Prediction, Sequence Derived Features(SDFs), Whitebox models, Decision Trees, C4.5, Enzyme Classification
URI: http://hdl.handle.net/123456789/3104
Appears in Collections:SCS Individual Project - Final Thesis (2014)

Files in This Item:
File Description SizeFormat 
10000161 - EDSC Karunapala - FinalDissertation.pdf
  Restricted Access
1.53 MBAdobe PDFView/Open Request a copy


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.