Search code examples
kubernetesgoogle-cloud-platformairflowgoogle-cloud-composer

How to access Google Cloud Composer's data folder from a pod launched using KubernetesPodOperator?


I have a Google Cloud Composer 1 environment (Airflow 2.1.2) where I want to run an Airflow DAG that utilizes the KubernetesPodOperator.

Cloud Composer makes available to all DAGs a shared file directory for storing application data. The files in the directory reside in a Google Cloud Storage bucket managed by Composer. Composer uses FUSE to map the directory to the path /home/airflow/gcs/data on all of its Airflow worker pods.

In my DAG I run several Kubernetes pods like so:

    from airflow.contrib.operators import kubernetes_pod_operator
    
    # ...

    splitter = kubernetes_pod_operator.KubernetesPodOperator(
        task_id='splitter',
        name='splitter',
        namespace='default',
        image='europe-west1-docker.pkg.dev/redacted/splitter:2.3',
        cmds=["dotnet", "splitter.dll"],
    )

The application code in all the pods that I run needs to read from and write to the /home/airflow/gcs/data directory. But when I run the DAG my application code is unable to access the directory. Likely this is because Composer has mapped the directory into the worker pods but does not extend this courtesy to my pods.

What do I need to do to give my pods r/w access to the /home/airflow/gcs/data directory?


Solution

  • Cloud Composer uses FUSE to mount certain directories from Cloud Storage into Airflow worker pods running in Kubernetes. It mounts these with default permissions that cannot be overwritten, because that metadata is not tracked by Google Cloud Storage. A possible solution is to use a bash operator that runs at the beginning of your DAG to copy files to a new directory. Another possible solution can be to use a non-Google Cloud Storage path like a /pod path.