Search code examples
amazon-web-servicesamazon-ec2scrapysplash-screen

Running Splash server and Scrapy spiders on the same Ec2 Instance


I'm deploying a web scraping application composed of Scrapy spiders that scrape content from websites as well as screenshot webpages with the Splash javascript rendering service. I want to deploy the whole application to a single Ec2 instance. But for the application to work I must run a splash server from a docker image at the same time I'm running my spiders. How can I run multiple processes on an Ec2 instance? Any advice on best practices would be most appreciated.


Solution

  • Total noob question. I found the best way to run a Splash server and Scrapy spiders on an Ec2 instance after configuration is via a bash script scheduled to run with a cronjob. Here is the bash script I came up with:

    #!bin/bash
    # Change to proper directory to run Scrapy spiders.
    cd /home/ec2-user/project_spider/project_spider
    
    # Activate my virtual environment.
    source /home/ec2-user/venv/python36/bin/activate # activate my virtual environment
    
    # Create a shell variable to store date at runtime 
    LOGDATE=$(date +%Y%m%dT%H%M%S);
    
    # Spin up splash instance from docker image.
    sudo docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600
    
    # Scrape first site and store dated log file in logs directory.
    scrapy crawl anhui --logfile /home/ec2-user/project_spider/project_spider/logs/anhui_spider/anhui_spider_$LOGDATE.log
    
    ...
    
    # Spin down splash instance via docker image.
    sudo docker rm $(sudo docker stop $(sudo docker ps -a -q --filter ancestor=scrapinghub/splash --format="{{.ID}}"))
    
    # Exit virtual environment.
    deactivate
    
    # Send an email to confirm cronjob was successful.
    #   Note that sending email from Ec2 is difficult and you can not use 'MAILTO' 
    #   in your cronjob without setting up something like postfix or sendmail. 
    #   Using Mailgun is an easy way around that. 
    
    curl -s --user 'api:<YOURAPIHERE>' \
        https://api.mailgun.net/v3/<YOURDOMAINHERE>/messages \
            -F from='<YOURDOMAINADDRESS>' \
            -F to=<RECIPIENT> \
            -F subject='Cronjob Run Successfully' \
            -F text='Cronjob completed.'