I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
- Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
- Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
- As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
- Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
- AWS Batch can integrate to both Airflow and Step Functions.
- You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
- For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
- Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
- Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature. (Update: It is added now: Callbacks
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):
- *With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
- **MWAA costs on environment, instance and storage. More details here.