python dataframe amazon-s3 kubeflow kubeflow-pipelines

What is the correct way for share dataframes between components?

I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.

In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.

In the components where the data frame is used for training or validating the models, download from S3 the data frame.

The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.

Thanks

Solution

As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".

However, there are certain considerations worth spelling out in your case.

Saving to S3 in between pipeline steps. This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that. What "long time" means is dependent on your use case though.
Passing the data directly from component to component. This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3. The downside being: if you fail mid-way in the pipeline, you have to start from scratch.

So the questions are:

Are the steps idempotent (restartable)?
How often the pipeline fails?
Is it easy to restart the processing from some mid-point?
Do you care about the processing time more than the risk of loosing some work?
Do you care about the incurred cost of S3 storage/transfer?