Search code examples
pythonkubernetesairflowgoogle-cloud-composer

Kubernetespodoperator how to use cmds or cmds and arguments to run multiple commands


I'm using GCP composer to run an algorithm and at the end of the stream I want to run a task that will perform several operations copying and deleting files and folders from a volume to a bucket I'm trying to perform these copying and deleting operations via a kubernetespodoperator. I'm having hardship finding the right way to run several commands using "cmds" I also tried using "cmds" with "arguments". Here is my KubernetesPodOperator and the cmds and arguments combinations I tried:

post_algo_run = kubernetes_pod_operator.KubernetesPodOperator(
    task_id="multi-coher-post-operations",
    name="multi-coher-post-operations",
    namespace="default",
    image="google/cloud-sdk:alpine",
    
    ### doesn't work ###
    cmds=["gsutil", "cp", "/data/splitter-output\*.csv",  "gs://my_bucket/data" , "&" , "gsutil", "rm", "-r", "/input"], 
    #Error:
        #[2022-01-27 09:31:38,407] {pod_manager.py:197} INFO - CommandException: Destination URL must name a directory, bucket, or bucket
        #[2022-01-27 09:31:38,408] {pod_manager.py:197} INFO - subdirectory for the multiple source form of the cp command.
    ####################

    ### doesn't work ###
    # cmds=["gsutil", "cp", "/data/splitter-output\*.csv",  "gs://my_bucket/data ;","gsutil", "rm", "-r", "/input"],
        # [2022-01-27 09:34:06,865] {pod_manager.py:197} INFO - CommandException: Destination URL must name a directory, bucket, or bucket
        # [2022-01-27 09:34:06,866] {pod_manager.py:197} INFO - subdirectory for the multiple source form of the cp command.
    ####################

    ### only preform the first command - only copying ###
    # cmds=["bash", "-cx"],
    # arguments=["gsutil cp /data/splitter-output\*.csv gs://my_bucket/data","gsutil rm -r /input"],                                    
        # [2022-01-27 09:36:09,164] {pod_manager.py:197} INFO - + gsutil cp '/data/splitter-output*.csv' gs://my_bucket/data
        # [2022-01-27 09:36:11,200] {pod_manager.py:197} INFO - Copying file:///data/splitter-output\Coherence Results-26-Jan-2022-1025Part1.csv [Content-Type=text/csv]...
        # [2022-01-27 09:36:11,300] {pod_manager.py:197} INFO - / [0 files][    0.0 B/ 93.0 KiB]                                                
        # / [1 files][ 93.0 KiB/ 93.0 KiB]
        # [2022-01-27 09:36:11,302] {pod_manager.py:197} INFO - Operation completed over 1 objects/93.0 KiB.
        # [20   22-01-27 09:36:12,317] {kubernetes_pod.py:459} INFO - Deleting pod: multi-coher-post-operations.d66b4c91c9024bd289171c4d3ce35fdd
    ####################


    volumes=[
        Volume(
            name="nfs-pvc",
            configs={
                "persistentVolumeClaim": {"claimName": "nfs-pvc"}
            },
        )
    ],
    volume_mounts=[
        VolumeMount(
            name="nfs-pvc",
            mount_path="/data/",
            sub_path=None,
            read_only=False,
        )
    ],
)

Solution

  • I found a technic for running multiple commands. First I found the relations between Kubernetespodoperator cmds and arguments properties to Docker's ENTRYPOINT and CMD.

    Kubernetespodoperator cmds overwrite the docker original ENTRYPOINT and Kubernetespodoperator arguments is equivalent to docker's CMD.

    And so in order to run multiple commands from the Kubernetespodoperator I've used the following syntax: I've set the Kubernetespodoperator cmds to run bash with -c:

    cmds=["/bin/bash", "-c"],
    

    And I've set the Kubernetespodoperator arguments to run two echo commands separated by &:

    arguments=["echo hello && echo goodbye"],
    

    So my Kubernetespodoperator looks like so:

    stajoverflow_test = KubernetesPodOperator(
        task_id="stajoverflow_test",
        name="stajoverflow_test",
        namespace="default",
        image="google/cloud-sdk:alpine",
        cmds=["/bin/bash", "-c"],
        arguments=["echo hello && echo goodbye"],
    )