Search code examples
amazon-ec2scrapyscrapyd

Scrapy on AWS EC2 : where to write the items?


I have a working spider on my local machine, which writes items to a local postgres database.

I am now trying to run the same spider through scrapyd on an EC2 instance. This obviously won't work, because the code (models, pipelines, settings files) refers to a database on my local machine.

Which adaptations should I implement to make this work ?


Solution

  • Found it, answer was easier than i thought. In the settings.py file, delete the settings for ITEM_PIPELINES and DATABASE. After deletion, deploy the project through scrapyd on EC2.

    By default, items will now be written as JSON-lines. This can be overridden with FEED_FORMAT and FEED_URI :

    sudo curl http:/xxxxxxxxx.us-west-2.compute.amazonaws.com:6800/schedule.json -d project=xxxxxxxxxx -d spider=xxxxxxxxx -d setting=FEED_URI=/var/lib/scrapyd/items/xxxxxxxxxx.csv -d setting=FEED_FORMAT=csv