Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4430
Title: Predicting Gene-Disease Association Type in Biotechnology Literature using Text Mining
Authors: Rathnayake, R.M.U.A.
Issue Date: 4-Aug-2021
Abstract: E xtracting Gene-Disease relations using text mining techniques and machine learning is a popular and important task in the present research context. As a single person can not read a bunch of papers to derive conclusions about a certain topic, text mining and machine learning models help a lot in this case. Revealing gene-disease associations in brute force approach is more accurate but involves in lengthy laborious work, effort intensive and resource intensive. Associations between gene and disease entities mentioned within biomedical literature can be used to develop new drugs development, diagnosis mechanisms and treatment mechanisms in western medicine. This study focuses on research questions: How to extract gene-disease associations from biomedical literature, How to build a predictive model for gene-disease association using machine learning and How to validate the predictive model. The aim is to predict gene disease association type (positive or negative) mentioned within the sentences. Genetic Association Database (GAD) and PubMed abstracts were used in training models and validating them. Training and validating the models carried out based on three cycles. In all three cycles, Naive Bayes (NB), Linear support vector machine (LSVM) and a logistic regression (LR) classifier models were trained and evaluated. In the first cycle of the training process, model training and validation tasks were based on entirely GAD. 70% of this data set used for training the models. In the second cycle PubMed abstracts were used to create a new testing data set.1000 sentences with gene and disease mentions were manually labeled according to the relationship type. Next, the models trained during the first cycle were used to predict the entire new data set. In the third cycle the models were re-trained entirely on GAD (100%) and again investigate the model accuracy. We obtained good results in predicting gene-disease association types in biomedical literature. The maximum accuracy achieved by NB, LSVM and LR classifiers were, 0.876, 0.951 and 0.929 respectively. Predicted results then visualize through a network graph which shows the predicted relationships between genes and diseases. Moreover, the querying interface was provided to query the output or the resulted findings where by providing either gene entity or disease entity of interest. This will allow any interested party to see multiple associations predicted for a particular entity. As a whole, the aim of the study is to derive a predictive model to identify gene-disease relations in human related biomedical literature using text mining and machine learning.
URI: http://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4430
Appears in Collections:2019

Files in This Item:
File Description SizeFormat 
2016MCS094.pdf1.35 MBAdobe PDFView/Open


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.