Search code examples
pythonkubeflowkubeflow-pipelines

Kuveflow volumens lost data


Hello I am trying to share file between steps, and In order to do this I have the following code:

VOLUME_NAME_PATH = 'pictures'
VOLUME_PATH = f'/{VOLUME_NAME_PATH}'
V1_VOLUME = k8s_client.V1Volume(name=VOLUME_NAME_PATH)
V1_VOLUME_MOUNT = k8s_client.V1VolumeMount(
                    mount_path=VOLUME_PATH,
                    name=VOLUME_NAME_PATH
                )

def pictures_pipeline():
    download_images_op_step = download_images_op(volume_path=VOLUME_PATH) \
        .add_volume(V1_VOLUME) \
        .add_volume_mount(V1_VOLUME_MOUNT)
    compress_images_op_step = compress_images_op(volume_path=VOLUME_PATH) \
        .add_volume(V1_VOLUME) \
        .add_volume_mount(V1_VOLUME_MOUNT)

    compress_images_op_step.after(download_images_op_step)

As you can I see I am creating a V1_VOLUMNE, and mounth the same for the all steps in the pipeline.

THe first step download_images_op_step, download and save the pictures in the volume, but when the second step starts the the volume is empty.

So how can I persis the data from one to another?

Thanks


Solution

  • Please check my answer to a similar question about volumes: https://stackoverflow.com/a/67898164/1497385

    The short answer is that the usage of volumes is not a supported way of passing data between components in KFP. I'm not saying it cannot work, but if a developer goes out of the officially supported data passing method they're on their own.

    Using KFP without KFP's data passing is pretty close to not using KFP at all...

    Here is how to pass data properly:

    from kfp.components import InputPath, OutputPath, create_component_from_func
    
    def download_images(
        url: str,
        output_path: OutputPath(),
    ):
        ...
        # Create directory at output_path
        # Put all images into it
    
    download_images_op = create_component_from_func(download_images)
    
    def compress_images(
        input_path: InputPath(),
        output_path: OutputPath(),
    ):
        # read images from input_path
        # write results to output_path
    
    compress_images_op = create_component_from_func(compress_images)
    
    def my_pipeline():
        images = download_images_op(
            url=...,
        ).outputs["output"]
    
        compressed_images = compress_images_op (
            input=images,
        ).outputs["output"]
    

    You can also find many examples of real-world components in this repo: https://github.com/Ark-kun/pipeline_components/tree/master/components

    P.S. As a small team we've spent so much time answering user questions about volumes not working despite the official documentation and all samples and tutorials showing how to use proper methods and never suggesting to use volumes. I want to understand where this comes from. Is there some unofficial KFP tutorial on the Internet that teaches users that the users should pass data via volumes?