This document contains information about several web crawlers and their features. It includes the names of the crawlers, their main features, the programming languages they are written in, descriptions, and URLs or ranks. Some of the crawlers mentioned are Heritrix, crawler4j, WebSPHINX, mozenda, viet spider, scrapy, and jspider.
2. Sheet1
discription language
Heritrix is the Internet Archive's open-source, extensible, web-
scale, archival-quality web crawler project. java
Crawler4j is an open source Java crawler which provides a
simple interface for crawling the Web. You can setup a multi-
threaded web crawle java
WebSPHINX ( Website-Specific Processors for HTML
INformation eXtraction) is a Java class library and interactive
development environment for web crawlers. A web crawler (also
called a robot or spider) is a program that browses and processes
Web pages automatically. java
tspider is a complete Web Data Extraction and automation
suite. It has a simple wizard-driven interface for common
tasks, but has much more advanced functionality than our
competitors. The solution in exploiting, collecting and
categorizing data from the internet serving specific purposes.
Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data
from their pages. It can be used for a wide range of purposes,
from data mining to monitoring and automated testing. python
java
Page 2