Distributed web crawlers using hadoop

Pratiba D; Shobha G; Lalithkumar H; Samrudh J.

Distributed web crawlers using hadoop

Journal

International Journal of Applied Engineering Research

Date Issued

2017

Author(s)

Pratiba D

RV College of Engineering

Shobha G

RV College of Engineering

Lalithkumar H

Samrudh J.

Abstract

Web Crawler is a software, which crawls through the WWW to build database for a search engine. In recent years, web crawling has started facing many challenges. Firstly, the web pages are highly unstructured which makes it difficult to maintain a generic schema for storage. Secondly, the WWW is too huge and it is impossible to index it as it is. Finally, the most difficult challenge is to crawl the deep web. Here we are proposing a novel web crawler, which uses Neo4J, HBase as data stores. It also applies Natural Language Processing (NLP) and machine learning techniques to resolve the above-mentioned problems. � Research India Publications.

Options

Distributed web crawlers using hadoop