Search code examples
pythonhttpscrapyscrapyd

Providing url for spider using scrapyd api


I tried something like:

payload = {"project": settings['BOT_NAME'],
             "spider": crawler_name,
             "start_urls": ["http://www.foo.com"]}
response = requests.post("http://192.168.1.41:6800/schedule.json",
                           data=payload)

And when I check the logs, I got this error code:

File "/usr/lib/pymodules/python2.7/scrapy/spider.py", line 53, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 26, in __init__
    self._set_url(url)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 61, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h

Looks like only the first letter of "http://www.foo.com" is used as request.url, and I really have no idea why.

Update

Maybe start_urls should be a string instead of a list containing 1 element, so I also tried:

"start_urls": "http://www.foo.com"

and

"start_urls": [["http://www.foo.com"]]

only to get the same error.


Solution

  • You could modify your spider to receive a url argument and append that to start_urls on init.

    class MySpider(Spider):
    
        start_urls = []
    
        def __init__(self, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            self.start_urls.append(kwargs.get('url'))
    
        def parse(self, response):
            # do stuff
    

    Your payload will now be:

    payload = {
        "project": settings['BOT_NAME'],
        "spider": crawler_name,
        "url": "http://www.foo.com"
    }