I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.
In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.
In the components where the data frame is used for training or validating the models, download from S3 the data frame.
The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.
Thanks
As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".
However, there are certain considerations worth spelling out in your case.
Saving to S3 in between pipeline steps. This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that. What "long time" means is dependent on your use case though.
Passing the data directly from component to component. This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3. The downside being: if you fail mid-way in the pipeline, you have to start from scratch.
So the questions are: