I'm trying to make a scrapy scraper work using cloud run. The main idea is that every 20 minutes a cloud scheduler cron should trigger the web scraper and get data from different sites. All sites have the same structure, so I would like to use same code and parallelize the execution of the scraping job, doing something like scrapy crawl scraper -a site=www.site1.com
and scrapy crawl scraper -a site=www.site2.com
.
I have already deployed a version of the scraper, but it only can do scrapy crawl scraper
. How can I do that at execution the command's site change?
Also, should I be using cloud run job or service?
According to that page of documentation, there is a trick.
CLOUD_RUN_TASK_INDEX
environment variable. That variable indicate the number of the task in the execution. For each different number, pick a line in your file of websites (the number of the line equal to the env var value).Like that, you can leverage Cloud Run jobs and parallelism.
The main tradeoff here is the static form of the websites list to scrap.