I am using requests in order to fetch and parse some data scraped using Scrapy with Scrapyrt (real time scraping).
This is how I do it:
#pass spider to requests parameters #
params = {
'spider_name': spider,
'start_requests':True
}
# scrape items
response = requests.get('http://scrapyrt:9080/crawl.json', params)
print ('RESPONSE JSON',response.json())
data = response.json()
As per Scrapy documentation, with 'start_requests'
parameter set as True
, the spider automatically requests urls and passes the response to the parse method which is the default method used for parsing requests.
start_requests
type: boolean
optional
Whether spider should execute Scrapy.Spider.start_requests method. start_requests are executed by default when you run Scrapy Spider normally without ScrapyRT, but this method is NOT executed in API by default. By default we assume that spider is expected to crawl ONLY url provided in parameters without making any requests to start_urls defined in Spider class. start_requests argument overrides this behavior. If this argument is present API will execute start_requests Spider method.
But the setup is not working. Log:
[2019-05-19 06:11:14,835: DEBUG/ForkPoolWorker-4] Starting new HTTP connection (1): scrapyrt:9080
[2019-05-19 06:11:15,414: DEBUG/ForkPoolWorker-4] http://scrapyrt:9080 "GET /crawl.json?spider_name=precious_tracks&start_requests=True HTTP/1.1" 500 7784
[2019-05-19 06:11:15,472: ERROR/ForkPoolWorker-4] Task project.api.routes.background.scrape_allmusic[87dbd825-dc1c-4789-8ee0-4151e5821798] raised unexpected: JSONDecodeError('Expecting value: line 1 column 1 (char 0)',)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/src/app/project/api/routes/background.py", line 908, in scrape_allmusic
print ('RESPONSE JSON',response.json())
File "/usr/lib/python3.6/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The error was due to a bug with Twisted 19.2.0
, a scrapyrt dependency, which assumed response to be of wrong type.
Once I installed Twisted==18.9.0
, it worked.