Please use this identifier to cite or link to this item: https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/1795
Title: A Novel Approach to Focused Web Crawling Using Named Entity Recognition
Authors: Samarawickrama, S.
Issue Date:  12
Abstract: Within recent years the World Wide Web (WWW) has grown enormously large to a extent where generic web crawlers are unable to keep up with. As a result, focused web crawlers have gained its popularity which is focused only on a particular domain. But these crawlers are based on lexical terms where they ignore the information contained within named entities; but named entities can be a very good evidence when crawling on narrow domains. In this thesis we discuss a new approach to focus crawling utilizing named entities. Initially, we have done a pilot study to investigate the e ect of named en- tities as features in web page classi cation. We have conducted tests in ve di erent domains - Baseball, Football, Health, Politics and Science - with web pages collected manually from online news magazines. For the two narrow domains, Baseball and Football, we achieved a F1 score of 94.2% and 93.4% respectively, which is a 2.8% and 2.9% improvement respectively, when compared to lexical terms as features. Following these promising results, we have conducted experiments in focused web crawling in three narrow domains: Baseball, Football and American Politics. Our results showed that during anytime of the crawl, the collection built with our crawler is better than the traditional focused crawler based on lexical terms, in terms of the harvest ratio. And this was true for all the three domains considered.
URI: http://hdl.handle.net/123456789/1795
Appears in Collections:SCS Individual Project - Final Thesis (2012)

Files in This Item:
File Description SizeFormat 
26.pdf
  Restricted Access
1.97 MBAdobe PDFView/Open Request a copy


Items in UCSC Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.