Search code examples
amazon-web-servicesapache-sparkamazon-ec2airflowamazon-emr

how do I deploy my Airflow Scheduler to AWS EC2?


I'm trying to build a simple data pipeline on AWS using Airflow. I've created a DAG that scrapes data to S3 daily and then processes it with a Spark Job running on EMR.

I'm currently running the Airflow Scheduler on my laptop locally,but of course I know this isn't a good long term solution.

So I wanted to have some tips on deploying my scheduler to EC2 (istance size,deploying process or anything else it would be useful to know)


Solution

  • Running locally is usually not a feasible post test phase. So you need a running server. Following can be the options and guide for deploying it to AWS EC2.

    You can deploy airflow on EC2 instances with docker/airflow images. T3.medium will suffice. Usually with not too many DAGs. We can create a workflow that will run after every 7 days and clean up the log files, So in terms of disk it won't be a problem if the memory consumption remains constant. You can install and configure an airflow normally on ec2 as you do in your local computer, but I prefer setting it up with docker image by puckel here.

    First, you can either use AMI having docker installed or install it yourself.

    Next, pull the image from the docker

    docker pull puckel/docker-airflow
    

    Here, you might get problem with SQLAlchemy version conflit(if not ignore this step. So change this line in DockerFile to use another version of airflow, higher like 1.10.10

    ARG AIRFLOW_VERSION=1.10.9 # change this to 1.10.10 ( or hardcode sqlalchmy version)
    

    Next, you may need to add a user in postgres.

    Now you can run it as

    docker run -d -p 8080:8080 puckel/docker-airflow webserver
    

    In order to enter command line (for starting the executor,scheduler etc), grab the container name/id from

    docker ps
    

    and use this command

    docker exec -ti <name_of_container> bash
    

    Also, in order to mount the ec2 folder with docker dags folder, you can mount it like below and your dags will be synced with airflow dags

    docker run -d -p 8080:8080 -v /path/to/dags/on/your/ec2/folder/:/usr/local/airflow/dags  puckel/docker-airflow webserver
    

    In order to access this in browser from any other computer (you own laptop)

    First, enable ec2 http port 8080 from security group for your IP

    and from browser, you would be able to access it as

    <ec2-public-ip>:8080
    

    Other third party managed option to run airflow on AWS

    Astronomer is a company that provides Fully hosted Airflow on all cloud platforms with advanced features of monitoring etc. They have some of the top Airflow contributes in their team

    Cost:

    Monthly cost for running Airflow for whole month on t3.medium will be around 32.37 USD and can be calculated here

    Astronomer cost around 100$/month per 10 AU(1 CPU, 3.75 GB memory) (but there are trade offs, its manage by astronomer and they provide support etc)