Search code examples
pythonpython-3.xscrapytwisted

Throttle Requests in Scrapy


I am developing a spider with Scrapy that iterates through a keyed url. For example, it will use a url as a template (eg https:\google.com{key}). I am having a problem where I can not get it to stop iterating through those urls at the right time. For example, if I begin to receive enough failed requests such as 404s I would like to terminate so I am not sending more requests than needed.

I attempted to raise CloseSpider(). This partially works. It will stop the spider, but not before some requests finish going through.

I then attempted to just continually yield the requests while keeping track of how many requests have executed/ failed. The problem is I don't think Scrapy can run asynchronously from start_requests.

I really need one of two solutions:

1) A way to dynamically yield results from start_requests from Scrapy (from another article this doesn't seem possible).That way I can keep track of current errors and only finish yielding the results when I know i haven't hit a certain error threshold.

2) How to allow the already downloaded pages to finish processing through their callbacks and pipelines when a CloseSpider exception is thrown. This way any non 404 hits actaully finish.


Solution

  • I figured this out.Since I am traversing in keyed order expecting a key to eventually not exist, I need to configure scrapy to work in FIFO order instead of the default LIFO order in settings.py:

        DEPTH_PRIORITY = 1
        SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
        SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
    

    I also ensured that the 2 and three depth requests had a higher priority than the start requests. Then, by keeping track of 404s I was able to raise the CloseSpider exception with all expected results completed.