This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.
The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.
Please see the
requirements.txt within each sub project for Pip package dependencies.
Other important components required to run the cluster
- Python 2.7: https://www.python.org/downloads/
- Redis: http://redis.io
- Zookeeper: https://zookeeper.apache.org
- Kafka: http://kafka.apache.org
This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include: