My Scrapy spider is hosted at scrapinghub. It is managed via run spider API call. The only thing that changes in spider from call to call is a list of start urls. The list may vary from 100 urls to couple thousand. What is the best way to update start urls in this scenario? From what I see there is no direct option in SH API for this. I am thinking of updating MySql with list of urls and once updated send simple Run job API call. (Start urls will be generated from MySql table). Any comments on such solution or other options?
My current setup is as follows.
def __init__(self, startUrls, *args, **kwargs):
self.keywords = ['sales','advertise','contact','about','policy','terms','feedback','support','faq']
self.startUrls = startUrls
self.startUrls = json.loads(self.startUrls)
super(MySpider, self).__init__(*args, **kwargs)
def start_requests(self):
for url in self.startUrls:
yield Request(url=url)
You can pass parameters to scrapy spider and read them inside your spider.
Send list of URLs encoded as JSON and then decode them, and now fire requests.
class MySpider(scrapy.Spider):
def __init__(self, startUrls, *args, **kwargs):
self.startUrls = startUrls
self.startUrls = json.loads(self.startUrls)
super(MySpider, self).__init__(*args, **kwargs)
def start_requests(self):
for url in self.startUrls:
yield Request(url=url ... )
And here is how you run send this parameter to your spider.
curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d spider=SPIDER -d startUrls="JSON_ARRAY_OF_LINKS_HERE"
Your scrapinghub.yml
file should be like this
projects:
default: 160868