Search code examples
python-3.xscrapygoogle-cloud-platformgoogle-cloud-storagescrapinghub

Scrapy, Scrapinghub and Google Cloud Storage: Keyerror 'gs' while running the spider on scrapinghub


I'm working on a scrapy project using Python 3 and the spiders are deployed to scrapinghub. I'm also using Google Cloud Storage to store the scraped files as mentioned in the official doc here.

The spiders are running absolutely fine when i'm running it locally and the spiders are getting deployed to scrapinghub without any errors. I'm using scrapy:1.4-py3 as the stack for the scrapinghub. While running the spiders on it, i'm getting the following error:

    Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/media.py", line 68, in from_crawler
    pipe = cls.from_settings(crawler.settings)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 95, in from_settings
    return cls(store_uri, settings=settings)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 52, in __init__
    download_func=download_func)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 234, in __init__
    self.store = self._get_store(store_uri)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 269, in _get_store
    store_cls = self.STORE_SCHEMES[scheme]
KeyError: 'gs'

PS: 'gs' is used in the path to store the files like

'IMAGES_STORE':'gs://<bucket-name>/'

I have researched about this error, but there aren't any solutions as such. Any help would be of immense help.


Solution

  • Google Cloud Storage support is a new feature in Scrapy 1.5, so you need to use scrapy:1.5-py3 stack in Scrapy Cloud.