Search code examples
pythonscrapytwistedscrapyd

Parallelism/Performance problems with Scrapyd and single spider


Context

I am running scrapyd 1.1 + scrapy 0.24.6 with a single "selenium-scrapy hybrid" spider that crawls over many domains according to parameters. The development machine that host scrapyd's instance(s?) is an OSX Yosemite with 4 cores and this is my current configuration:

[scrapyd]
max_proc_per_cpu = 75
debug = on

Output when scrapyd starts:

2015-06-05 13:38:10-0500 [-] Log opened.
2015-06-05 13:38:10-0500 [-] twistd 15.0.0 (/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python 2.7.9) starting up.
2015-06-05 13:38:10-0500 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2015-06-05 13:38:10-0500 [-] Site starting on 6800
2015-06-05 13:38:10-0500 [-] Starting factory <twisted.web.server.Site instance at 0x104b91f38>
2015-06-05 13:38:10-0500 [Launcher] Scrapyd 1.0.1 started: max_proc=300, runner='scrapyd.runner'

EDIT:

Number of cores:

python -c 'import multiprocessing; print(multiprocessing.cpu_count())' 
4

Problem

I would like a setup to process 300 jobs simultaneously for a single spider but scrapyd is processing 1 to 4 at a time regardless of how many jobs are pending:

Scrapy console with jobs

EDIT:

CPU usage is not overwhelming :

CPU Usage for OSX

TESTED ON UBUNTU

I have also tested this scenario on a Ubuntu 14.04 VM, results are more or less the same: a maximum of 5 jobs running was reached while execution, no overwhelming CPU consumption, more or less the same time was taken to execute the same amount of tasks.


Solution

  • My problem was that my jobs lasted for a time shorter that the POLL_INTERVAL default value which is 5 seconds, so no enough tasks were polled before the end of a previous one. Changing this settings to a value lower to the average duration of the crawler job will help scrapyd to poll more jobs for execution.