I'm trying to scrape a link that is HAL+Json according to Firefox and Safari and it is returning a response object that Scrapy doesn't recognise.
The link is https://catalogue.presto.com.au/ - this opens fine in Chrome showing JSON within the browser, but if I try to use Firefox or Safari it instead downloads the file. I'm suspecting Scrapy when opening the link downloads the file so it is not scraping it.
Has anyone encountered something similar or have a solution?
Accessing via Shell
When I try to access the website using terminal "scrapy shell https://catalogue.presto.com.au"
"2015-03-15 00:15:08+0700 [default] DEBUG: Crawled (200) <GET https://catalogue.presto.com.au>"
I then try to view(response) and get this error:
>>> view(response)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
response.__class__.__name__)
TypeError: Unsupported response type: Response
Running Scrapy object:
def parse(self, response):
print response.__class__
open_in_browser(response)
2015-03-15 00:23:05+0700 [prestotv2] DEBUG: Crawled (200) <GET
class 'scrapy.http.response.Response' (referer: None) #this line is from "print response.__class__
2015-03-15 00:23:05+0700 [prestotv2] ERROR: Spider error processing <GET https://catalogue.presto.com.au/>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/nathansu/Documents/Development/Whutstream/scraping/Presto/presto/spiders/TvSpider.py", line 38, in parse
open_in_browser(response)
File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
response.__class__.__name__)
exceptions.TypeError: Unsupported response type: Response
This is due to response Content-Type
being equal to application/hal+json
. Load it via json.loads()
(or use one of the libraries listed here) if you want to parse it:
$ scrapy shell https://catalogue.presto.com.au/
In [1]: response.headers
Out[1]:
{'Age': '0',
'Cache-Control': 'max-age=300, public, s-maxage=300',
'Content-Type': 'application/hal+json', # HERE
'Date': 'Sat, 14 Mar 2015 17:42:45 GMT',
'Etag': '"834550fbc4b5fc5a188bd801c45876b7613b998b"',
'Expires': 'Sat, 14 Mar 2015 17:47:45 GMT',
'Last-Modified': 'Sat, 14 Mar 2015 17:42:45 GMT',
'Server': 'Apache/2.2.3 (Red Hat)',
'Vary': 'Accept,Accept-Encoding',
'Via': '1.1 varnish',
'X-Powered-By': 'PHP/5.4.15',
'X-Varnish': '905097089'}
In [2]: import json
In [3]: json.loads(response.body)
Out[3]:
{u'_links': {u'curies': [{u'href': u'/rels/{rel}',
u'name': u'ooyala',
u'templated': True}],
...
{window?}&size={size?}&discovery_profile_id={discovery_profile_id?}&exclude_videos={exclude_videos?}&offer_type={offer_type}',
u'templated': True,
u'title': u'Trending series'},
u'self': {u'href': u'/'}},
u'version': u'1.6.0.1'}