I'm trying to build a simple data pipeline on AWS using Airflow. I've created a DAG that scrapes data to S3 daily and then processes it with a Spark Job running on EMR.
I'm currently running the Airflow Scheduler on my laptop locally,but of course I know this isn't a good long term solution.
So I wanted to have some tips on deploying my scheduler to EC2 (istance size,deploying process or anything else it would be useful to know)
Running locally is usually not a feasible post test phase. So you need a running server. Following can be the options and guide for deploying it to AWS EC2.
You can deploy airflow on EC2 instances with docker/airflow images. T3.medium will suffice. Usually with not too many DAGs. We can create a workflow that will run after every 7 days and clean up the log files, So in terms of disk it won't be a problem if the memory consumption remains constant. You can install and configure an airflow normally on ec2 as you do in your local computer, but I prefer setting it up with docker image by puckel here.
First, you can either use AMI having docker installed or install it yourself.
Next, pull the image from the docker
docker pull puckel/docker-airflow
Here, you might get problem with SQLAlchemy version conflit(if not ignore this step. So change this line in DockerFile
to use another version of airflow, higher like 1.10.10
ARG AIRFLOW_VERSION=1.10.9 # change this to 1.10.10 ( or hardcode sqlalchmy version)
Next, you may need to add a user in postgres.
Now you can run it as
docker run -d -p 8080:8080 puckel/docker-airflow webserver
In order to enter command line (for starting the executor
,scheduler
etc), grab the container name/id from
docker ps
and use this command
docker exec -ti <name_of_container> bash
Also, in order to mount the ec2 folder with docker dags folder, you can mount it like below and your dags will be synced with airflow dags
docker run -d -p 8080:8080 -v /path/to/dags/on/your/ec2/folder/:/usr/local/airflow/dags puckel/docker-airflow webserver
In order to access this in browser from any other computer (you own laptop)
First, enable ec2
http
port 8080
from security group for your IP
and from browser, you would be able to access it as
<ec2-public-ip>:8080
Other third party managed option to run airflow on AWS
Astronomer is a company that provides Fully hosted Airflow on all cloud platforms with advanced features of monitoring etc. They have some of the top Airflow contributes in their team
Cost:
Monthly cost for running Airflow for whole month on t3.medium will be around 32.37 USD and can be calculated here
Astronomer cost around 100$/month per 10 AU(1 CPU, 3.75 GB memory) (but there are trade offs, its manage by astronomer and they provide support etc)