I'm deploying a web scraping application composed of Scrapy spiders that scrape content from websites as well as screenshot webpages with the Splash javascript rendering service. I want to deploy the whole application to a single Ec2 instance. But for the application to work I must run a splash server from a docker image at the same time I'm running my spiders. How can I run multiple processes on an Ec2 instance? Any advice on best practices would be most appreciated.
Total noob question. I found the best way to run a Splash server and Scrapy spiders on an Ec2 instance after configuration is via a bash script scheduled to run with a cronjob. Here is the bash script I came up with:
#!bin/bash
# Change to proper directory to run Scrapy spiders.
cd /home/ec2-user/project_spider/project_spider
# Activate my virtual environment.
source /home/ec2-user/venv/python36/bin/activate # activate my virtual environment
# Create a shell variable to store date at runtime
LOGDATE=$(date +%Y%m%dT%H%M%S);
# Spin up splash instance from docker image.
sudo docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600
# Scrape first site and store dated log file in logs directory.
scrapy crawl anhui --logfile /home/ec2-user/project_spider/project_spider/logs/anhui_spider/anhui_spider_$LOGDATE.log
...
# Spin down splash instance via docker image.
sudo docker rm $(sudo docker stop $(sudo docker ps -a -q --filter ancestor=scrapinghub/splash --format="{{.ID}}"))
# Exit virtual environment.
deactivate
# Send an email to confirm cronjob was successful.
# Note that sending email from Ec2 is difficult and you can not use 'MAILTO'
# in your cronjob without setting up something like postfix or sendmail.
# Using Mailgun is an easy way around that.
curl -s --user 'api:<YOURAPIHERE>' \
https://api.mailgun.net/v3/<YOURDOMAINHERE>/messages \
-F from='<YOURDOMAINADDRESS>' \
-F to=<RECIPIENT> \
-F subject='Cronjob Run Successfully' \
-F text='Cronjob completed.'