Search code examples
pythondockerairflowgoogle-cloud-composerdockerpy

write a file from a docker container in google-cloud-composer


Some context: I'm using composer-1.3.0-airflow-1.10.0

Installed PyPi package docker===2.7.0

For a while I tried to use the DockerOperator, but I need to pull images from a private gcr.io registry located in another gcp-project, and that is a mess.

I won't go into the details of why I gave up on this.. I switched to a simple PythonOperator used to pull and run the docker image. Here how the Operator goes:

def runImage(**kwargs):
    workingDir = "/app"
    imageName = "eu.gcr.io/private-registry/image"
    volume = {"/home/airflow/gcs/data/": {"bind": "/out/", "mode": "rw"}}
    userUid = os.getuid()
    command = getContainerCommand()
    client = getClient()
    print("pulling image")
    image = pullDockerImage(client, imageName)
    print("image pulled. %s", image.id)
    output = client.containers.run(
        image=imageName,
        command=command,
        volumes=volume,
        privileged=True,
        working_dir=workingDir,
        remove=True,
        read_only=False,
        user=userUid)
    print output
    return True


task = PythonOperator(
    task_id="test_pull_docker_image",
    python_callable=runImage,
    dag=dag
)

Image is well pulled. And it run (which was already a victory).

The container write some files to /out/, which I mounted as a volume to /home/airflow/gcs/data with rw rights.

The working_dir, user, privileged, read_only options were added for test, but I don't think they're relevent.

The files are not created. writing a file directly in pyhton to /home/airflow/gcs/data works just fine.

The container itself is a complied C#. Locally if the container fails to write the files I get an error (like Unhandled Exception: System.UnauthorizedAccessException: Access to the path '/out/file.txt' is denied. ---> System.IO.IOException: Permission denied)

But when I run the DAG inside airlfow composer everything looks just fine, container output is as expected, no error raised.

Maybe the Dockerfile could be usefull:

FROM microsoft/dotnet:2.1-sdk AS build-env
WORKDIR /app

# Copy csproj and restore as distinct layers
COPY *.csproj ./
RUN dotnet restore

# Copy everything else and build
COPY . ./
RUN dotnet publish -c Release -o out

# Build runtime image
FROM microsoft/dotnet:2.1-sdk
WORKDIR /app
COPY --from=build-env /app/out .
ENTRYPOINT ["dotnet", "programm.dll"]

So the question is,

Why does it not write the files? And how to allow the container to write files to /home/airflow/gcs/data?


Solution

  • So I resolved this issue thanks to my other question

    The answer here is in two parts:

    /home/airflow/gcs is a gcsfuse volume. Using this directory for the DockerVolume just doesn't work (may work by adding a plugin, I lost the link for this :/ )

    We want to add a volume inside the airflow-workers, we can do so by updating the kubectl config : see this question for the how to update the config. We want to add a hostPath:

    containers:
      ...
      securityContext:
        privileged: true
        runAsUser: 0
        capabilities:
          add: 
          - SYS_ADMIN
      ...
      volumeMounts:
      - mountPath: /etc/airflow/airflow_cfg
        name: airflow-config
      - mountPath: /home/airflow/gcs
        name: gcsdir
      - mountPath: /var/run/docker.sock
        name: docker-host
      - mountPath: /bin/docker
        name: docker-app
      - mountPath: /path/you/want/as/volume
        name: mountname
      ...
      volumes:
      - configMap:
        defaultMode: 420
        name: airflow-configmap
      name: airflow-config
      - emptyDir: {}
        name: gcsdir
      - hostPath:
          path: /path/you/want/as/volume
          type: DirectoryOrCreate
        name: mountname
      - hostPath:
          path: /var/run/docker.sock
          type: ""
        name: docker-host
      - hostPath:
          path: /usr/bin/docker
          type: ""
        name: docker-app
    

    And now in the DAG definition we can use volume = {"/path/you/want/as/volume": {"bind": "/out/", "mode": "rw"}}

    File will exists inside the POD, and you can use another task to upload them in a gcs bucket or so.

    Hope it can help somewhat :)