Update start urls at scrapinghub hosted Scrapy project via API call

My Scrapy spider is hosted at scrapinghub. It is managed via run spider API call. The only thing that changes in spider from call to call is a list of start urls. The list may vary from 100 urls to couple thousand. What is the best way to update start urls in this scenario? From what I see there is no direct option in SH API for this. I am thinking of updating MySql with list of urls and once updated send simple Run job API call. (Start urls will be generated from MySql table). Any comments on such solution or other options?

My current setup is as follows.

def __init__(self, startUrls, *args, **kwargs):

    self.keywords = ['sales','advertise','contact','about','policy','terms','feedback','support','faq']

    self.startUrls = startUrls

    self.startUrls = json.loads(self.startUrls)

    super(MySpider, self).__init__(*args, **kwargs)

def start_requests(self):

    for url in self.startUrls:

        yield Request(url=url)

Solution

You can pass parameters to scrapy spider and read them inside your spider.

Send list of URLs encoded as JSON and then decode them, and now fire requests.

class MySpider(scrapy.Spider):

    def __init__(self, startUrls, *args, **kwargs):

        self.startUrls = startUrls

        self.startUrls = json.loads(self.startUrls)

        super(MySpider, self).__init__(*args, **kwargs)


    def start_requests(self):

        for url in self.startUrls:

            yield Request(url=url ... )

And here is how you run send this parameter to your spider.

curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d spider=SPIDER -d startUrls="JSON_ARRAY_OF_LINKS_HERE"

Your scrapinghub.yml file should be like this

projects:
  default: 160868