I am running scrapyd 1.1 + scrapy 0.24.6 with a single "selenium-scrapy hybrid" spider that crawls over many domains according to parameters. The development machine that host scrapyd's instance(s?) is an OSX Yosemite with 4 cores and this is my current configuration:
[scrapyd]
max_proc_per_cpu = 75
debug = on
Output when scrapyd starts:
2015-06-05 13:38:10-0500 [-] Log opened.
2015-06-05 13:38:10-0500 [-] twistd 15.0.0 (/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python 2.7.9) starting up.
2015-06-05 13:38:10-0500 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2015-06-05 13:38:10-0500 [-] Site starting on 6800
2015-06-05 13:38:10-0500 [-] Starting factory <twisted.web.server.Site instance at 0x104b91f38>
2015-06-05 13:38:10-0500 [Launcher] Scrapyd 1.0.1 started: max_proc=300, runner='scrapyd.runner'
Number of cores:
python -c 'import multiprocessing; print(multiprocessing.cpu_count())'
4
I would like a setup to process 300 jobs simultaneously for a single spider but scrapyd is processing 1 to 4 at a time regardless of how many jobs are pending:
CPU usage is not overwhelming :
I have also tested this scenario on a Ubuntu 14.04 VM, results are more or less the same: a maximum of 5 jobs running was reached while execution, no overwhelming CPU consumption, more or less the same time was taken to execute the same amount of tasks.
My problem was that my jobs lasted for a time shorter that the POLL_INTERVAL default value which is 5 seconds, so no enough tasks were polled before the end of a previous one. Changing this settings to a value lower to the average duration of the crawler job will help scrapyd to poll more jobs for execution.