Hello I am training a yolo in a kubeflow pipeline, in order to this, I have a set of pictures more than 1GB.
Currently, I download all images from minio to the container with a script and after that I train the model.
I am not sure if is there any best practice about this, because downloading 1GB per each training is a lot.
is there another way to do this and avoiding building a minio scripts to download picture dataset? can I use a shared volume or something like that in order to share files between operators (the idea is to train another model with the same dataset)
We advise you use KFP's built-in data passing methods. This way you get reproducibility, immutability caching etc.
You should split your pipeline into multiple components:
Download->Preprocess->Train
This way, the outputs of the Download
task are cached and it's never executed again. Same with the Preprocess
task.
downloading 1GB per each training is a lot.
Kubernetes volumes are connected through network anyway. Getting data from one machine to another is "downloading" no matter how it's done. What you want to do with volumes is actually slower. When you train for 100 epochs, with KFP data passing, the data is only downloaded/mounted once. With shared volume, the data will be downloaded 100 times.