Towards a Statistical Model of
Source Code

Pankajan, C.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/2451

Title:	Towards a Statistical Model of Source Code
Authors:	Pankajan, C.
Issue Date:	20-May-2014
Abstract:	Nowadays, life depends to such an extent on programmable electronic devices like personal computers and smart phones, that programming has become a fundamental requirement in almost every field. While writing code is not necessarily a difficult task, writing high quality code is. Syntax errors are very easily identifiable with compilers and modern IDEs, but code quality is not simply a matter of syntax. Quality models like, ISO/IEC 9126-1 describes several different aspects of good quality code. Code quality is important due to the impact it has on the robustness and maintainability of the resulting programs. In this thesis, We propose a novel statistical approach to the analysis of source code. This approach provides us a means of measuring code quality which goes beyond traditional metrics like Cyclomatic Complexity and Lines of Code. In recent years, the similarity between the programming languages and natural languages has been explored in the academic literature and approaches from NLP have been successfully applied to programming languages . Document modelling the idea of statistically modelling the structure and content of a document is a very successful and widely applied technique for many tasks in NLP. Our goal in this work is to apply document modelling approaches to source code, and thereby capture implicit code patterns as a means to evaluate source code quality. One of the primary patterns in source code is the sequence in which methods or functions are called. Our approach builds a statistical model of such method transitions, in a manner which allows us to successfully handle previously unseen methods. We do this by representing each method via a feature vector which encapsulates any available information about that method. This representation allows us to extrapolate the characteristics of unseen methods from known methods. First, we evalaute the possibility of the usage of statistical models on identifying recurring patterns in source code by using our model to predict the methods in several test projects. It successfully predicts actual methods within 4 attempts for 50% of the test cases compared to the 44 suggestions of the baseline method. Our approach is also able to identify methods with similar functionality via its simple feature representation. Finally we also show that, in real world code, the probability scores computed by our method for the same code segment increases with code revisions i.e., since one may expect code segments to improve in quality over time, the probability scores computed by our method potentially correlate with code quality.
URI:	http://hdl.handle.net/123456789/2451
Appears in Collections:	SCS Individual Project - Final Thesis (2013)

Files in This Item:

File	Size	Format
9001042.pdf Restricted Access	1.41 MB	Adobe PDF	View/Open Request a copy

Show full item record