Search code examples
dockerairflowlocalgoogle-cloud-composerdata-pipeline

Airflow on Google Cloud Composer vs Docker


I can't find much information on what the differences are in running Airflow on Google Cloud Composer vs Docker. I am trying to switch our data pipelines that are currently on Google Cloud Composer onto Docker to just run locally but am trying to conceptualize what the difference is.


Solution

  • Cloud Composer is a GCP managed service for Airflow. Composer runs in something known as a Composer environment, which runs on Google Kubernetes Engine cluster. It also makes use of various other GCP services such as:

    • Cloud SQL - stores the metadata associated with Airflow,
    • App Engine Flex - Airflow web server runs as an App Engine Flex application, which is protected using an Identity-Aware Proxy,
    • GCS bucket - in order to submit a pipeline to be scheduled and run on Composer, all that we need to do is to copy out Python code into a GCS bucket. Within that, it'll have a folder called DAGs. Any Python code uploaded into that folder is automatically going to be picked up and processed by Composer.

    How Cloud Composer benefits?

    • Focus on your workflows, and let Composer manage the infrastructure (creating the workers, setting up the web server, the message brokers),

    • One-click to create a new Airflow environment,

    • Easy and controlled access to the Airflow Web UI,

    • Provide logging and monitoring metrics, and alert when your workflow is not running,

    • Integrate with all of Google Cloud services: Big Data, Machine Learning and so on. Run jobs elsewhere, i.e. other cloud provider (Amazon).

    Of course you have to pay for the hosting service, but the cost is low compare to if you have to host a production airflow server on your own.

    Airflow on-premise

    • DevOps work that need to be done: create a new server, manage Airflow installation, takes care of dependency and package management, check server health, scaling and security.
    • pull an Airflow image from a registry and creating the container
    • creating a volume that maps the directory on local machine where DAGs are held, and the locations where Airflow reads them on the container,
    • whenever you want to submit a DAG that needs to access GCP service, you need to take care of setting up credentials. Application's service account should be created and downloaded as a JSON file that contains the credentials. This JSON file must be linked into your docker container and the GOOGLE_APPLICATION_CREDENTIALS environment variable must contain the path to the JSON file inside the container.

    To sum up, if you don’t want to deal with all of those DevOps problem, and instead just want to focus on your workflow, then Google Cloud composer is a great solution for you.

    Additionally, I would like to share with you tutorials that set up Airflow with Docker and on GCP Cloud Composer.