Search code examples
pythonjsonscrapyresponsehal

scrapy hal+json unsupported response type


I'm trying to scrape a link that is HAL+Json according to Firefox and Safari and it is returning a response object that Scrapy doesn't recognise.

The link is https://catalogue.presto.com.au/ - this opens fine in Chrome showing JSON within the browser, but if I try to use Firefox or Safari it instead downloads the file. I'm suspecting Scrapy when opening the link downloads the file so it is not scraping it.

Has anyone encountered something similar or have a solution?

Accessing via Shell

When I try to access the website using terminal "scrapy shell https://catalogue.presto.com.au"

"2015-03-15 00:15:08+0700 [default] DEBUG: Crawled (200) <GET https://catalogue.presto.com.au>"

I then try to view(response) and get this error:

>>> view(response)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
    response.__class__.__name__)
TypeError: Unsupported response type: Response

Running Scrapy object:

def parse(self, response):
    print response.__class__
    open_in_browser(response)


2015-03-15 00:23:05+0700 [prestotv2] DEBUG: Crawled (200) <GET 

class 'scrapy.http.response.Response' (referer: None) #this line is from "print response.__class__

2015-03-15 00:23:05+0700 [prestotv2] ERROR: Spider error processing <GET https://catalogue.presto.com.au/>
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1201, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 382, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 490, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/Users/nathansu/Documents/Development/Whutstream/scraping/Presto/presto/spiders/TvSpider.py", line 38, in parse
        open_in_browser(response)
      File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
        response.__class__.__name__)
    exceptions.TypeError: Unsupported response type: Response

Solution

  • This is due to response Content-Type being equal to application/hal+json. Load it via json.loads() (or use one of the libraries listed here) if you want to parse it:

    $ scrapy shell https://catalogue.presto.com.au/
    In [1]: response.headers
    Out[1]: 
    {'Age': '0',
     'Cache-Control': 'max-age=300, public, s-maxage=300',
     'Content-Type': 'application/hal+json',  # HERE
     'Date': 'Sat, 14 Mar 2015 17:42:45 GMT',
     'Etag': '"834550fbc4b5fc5a188bd801c45876b7613b998b"',
     'Expires': 'Sat, 14 Mar 2015 17:47:45 GMT',
     'Last-Modified': 'Sat, 14 Mar 2015 17:42:45 GMT',
     'Server': 'Apache/2.2.3 (Red Hat)',
     'Vary': 'Accept,Accept-Encoding',
     'Via': '1.1 varnish',
     'X-Powered-By': 'PHP/5.4.15',
     'X-Varnish': '905097089'}
    In [2]: import json
    
    In [3]: json.loads(response.body)
    Out[3]: 
    {u'_links': {u'curies': [{u'href': u'/rels/{rel}',
        u'name': u'ooyala',
        u'templated': True}],
    ...
    {window?}&size={size?}&discovery_profile_id={discovery_profile_id?}&exclude_videos={exclude_videos?}&offer_type={offer_type}',
       u'templated': True,
       u'title': u'Trending series'},
      u'self': {u'href': u'/'}},
     u'version': u'1.6.0.1'}