Java Script Web Cracoler

Silva, S.De.

Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/434

Title:	Java Script Web Cracoler
Authors:	Silva, S.De.
Issue Date:	22-Oct-2013
Abstract:	The World Wide Web is a rapidly growing and changing information source. Due to the dynamic nature of the Web, it becomes harder to find relevant and recent information. Web crawlers play a major role in collecting such information from the web. A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit. As it visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to be visited, recursively browsing the Web according to a set of policies. Web Crawlers work as information collectors which capture specific types of information like site meta data, assert count, broken links, email addresses (mainly for spam)etc. Also they work as sitemap generators where the entire site is mapped in to a single tree. It is quite possible to implement a simple functioning web crawler in a few lines of a high-level scripting language such as Perl, but using a client side scripting language such as JavaScript to achieve the same functionality is a challenge. Asynchronous JavaScript and XML (Ajax) is todays’ key technology driving the new generation of Web sites, termed as Web 2.0. Ajax allows data retrieval in the background without interfering with the display and behavior of the Web application. Data is retrieved using the XMLHttpRequest function, which is an API that lets client-side JavaScript make HTTP connections to remote servers. This approach, however, does not allow cross-domain communication because of restrictions imposed by the browser. If you try to request data from a different domain, you will get a security error. How great the technology is, one would simply choose a server side language and contiune his work without wasting the time on language restrictions like such . But what’s the fun of following the same traditional methods of programming in an industry full of new inventions? Most of the popular web crawlers’ work as sitemap generators tends to give a link report instead of the actual site structure. You specify a start-URL and the Crawler follows all the links found in that HTML page and list down under the root link. Links found in the root page are not all its’ children and hence these tools ends up in resulting a wrong sitemap. In this dissertation an attempt is made to present a new model and an architecture for and effective Web Crawler as a Firefox extension which uses JavaScript as the implementation language, by braking the Same origin policy barriers that client side programming languages are suffocating with. The dissertation also describes an Algorithm which can be used to output the site structure which depicts actual parent child relationship between the pages providing a more accurate sitemap.
URI:	http://hdl.handle.net/123456789/434
Appears in Collections:	Master of Computer Science - 2012

Files in This Item:

File	Description	Size	Format
2008-mcs-008_dissertation.pdf Restricted Access		1.99 MB	Adobe PDF	View/Open Request a copy

Show full item record