Search code examples
scrapyscrapydscrapy-splash

scrapy_splash.SplashRequest doesn't execute callback function when scheduled by scrapyd


I did encounter some strange behaviour (to my perspective of knowledge) of SplashRequest's callback when it's executed by scrapyd.

Scrapy Sourcecode

from scrapy.spiders.Spider import Spider
from scrapy import Request
import scrapy
from scrapy_splash import SplashRequest
class SiteSaveSpider(Spider):

    def __init__(self, domain='', *args, **kwargs):
        super(SiteSaveSpider, self).__init__(*args, **kwargs)
        self.start_urls = [domain]
        self.allowed_domains = [domain]
    name = "sitesavespider"


    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, callback=self.parse, args={'wait':0.5})

            print "TEST after yield"

    def parse(self, response):
        print "TEST in parse"
        with open('/some_path/test.html', 'w') as f:
            for line in response.body:
                f.write(line)

Log for internal Scrapy Crawler

The callback parse function is executed when started by

scrapy crawl sitesavespider -a domain="https://www.facebook.com"
...
2017-01-29 14:12:37 [scrapy.core.engine] INFO: Spider opened
2017-01-29 14:12:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
TEST after yield
2017-01-29 14:12:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.facebook.com via http://127.0.0.1:8050/render.html> (referer: None)
TEST in parse
2017-01-29 14:12:55 [scrapy.core.engine] INFO: Closing spider (finished)
...

Log for scrapyd

When starting the same spider by using scrapyd it will return directly after the SplashRequest:

>>>scrapyd.schedule("feedbot","sitesavespider",domain="https://www.facebook.com")
u'f2f4e090e62d11e69da1342387f8a0c9'

cat f2f4e090e62d11e69da1342387f8a0c9.log
... 
2017-01-29 14:19:34 [scrapy.core.engine] INFO: Spider opened
2017-01-29 14:19:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-29 14:19:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.facebook.com via http://127.0.0.1:8050/render.html> (referer: None)
2017-01-29 14:19:58 [scrapy.core.engine] INFO: Closing spider (finished)
...

Is somebody aware of this problem or could help me to find my mistakes?


Solution

  • After trying to reconstruct the problem on another computer, it wasn't there anymore and I cannot proof it. For others, trying to debug such issues:

    • print calls in own spider won't be outputted by scrapyd into the logfile by default, but into the terminalsession where scrapyd was started

    2017-02-21 16:24:29+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [21/Feb/2017:15:24:28 +0000] "GET /listjobs.json?project=feedbot HTTP/1.1" 200 199 "-" "python-requests/2.2.1 CPython/2.7.6 Linux/3.13.0-86-generic"
    2017-02-21 16:24:29+0100 [Launcher,17915/stdout] TEST after yield
    TEST in parse
    2017-02-21 16:24:29+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [21/Feb/2017:15:24:28 +0000] "GET /listjobs.json?project=feedbot HTTP/1.1" 200 199 "-" "python-requests/2.2.1 CPython/2.7.6 Linux/3.13.0-86-generic"