I did encounter some strange behaviour (to my perspective of knowledge) of SplashRequest's callback when it's executed by scrapyd.
from scrapy.spiders.Spider import Spider
from scrapy import Request
import scrapy
from scrapy_splash import SplashRequest
class SiteSaveSpider(Spider):
def __init__(self, domain='', *args, **kwargs):
super(SiteSaveSpider, self).__init__(*args, **kwargs)
self.start_urls = [domain]
self.allowed_domains = [domain]
name = "sitesavespider"
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, callback=self.parse, args={'wait':0.5})
print "TEST after yield"
def parse(self, response):
print "TEST in parse"
with open('/some_path/test.html', 'w') as f:
for line in response.body:
f.write(line)
The callback parse function is executed when started by
scrapy crawl sitesavespider -a domain="https://www.facebook.com"
...
2017-01-29 14:12:37 [scrapy.core.engine] INFO: Spider opened
2017-01-29 14:12:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
TEST after yield
2017-01-29 14:12:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.facebook.com via http://127.0.0.1:8050/render.html> (referer: None)
TEST in parse
2017-01-29 14:12:55 [scrapy.core.engine] INFO: Closing spider (finished)
...
When starting the same spider by using scrapyd it will return directly after the SplashRequest:
>>>scrapyd.schedule("feedbot","sitesavespider",domain="https://www.facebook.com")
u'f2f4e090e62d11e69da1342387f8a0c9'
cat f2f4e090e62d11e69da1342387f8a0c9.log
...
2017-01-29 14:19:34 [scrapy.core.engine] INFO: Spider opened
2017-01-29 14:19:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-29 14:19:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.facebook.com via http://127.0.0.1:8050/render.html> (referer: None)
2017-01-29 14:19:58 [scrapy.core.engine] INFO: Closing spider (finished)
...
Is somebody aware of this problem or could help me to find my mistakes?
After trying to reconstruct the problem on another computer, it wasn't there anymore and I cannot proof it. For others, trying to debug such issues:
2017-02-21 16:24:29+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [21/Feb/2017:15:24:28 +0000] "GET /listjobs.json?project=feedbot HTTP/1.1" 200 199 "-" "python-requests/2.2.1 CPython/2.7.6 Linux/3.13.0-86-generic"
2017-02-21 16:24:29+0100 [Launcher,17915/stdout] TEST after yield
TEST in parse
2017-02-21 16:24:29+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [21/Feb/2017:15:24:28 +0000] "GET /listjobs.json?project=feedbot HTTP/1.1" 200 199 "-" "python-requests/2.2.1 CPython/2.7.6 Linux/3.13.0-86-generic"