google-cloud-storage google-cloud-vertex-ai

What's the most efficient way of loading data for training?

I currently use Vertex AI Custom training where:

Custom training (Pytorch) with dataset in GCS
Every time when Vertex AI launches a training job, it clones and shard my data into a staging bucket
My training job loads the data into my training application using TorchData by streaming data from the staging bucket (example https://pytorch.org/data/beta/dp_tutorial.html#accessing-google-cloud-storage-gcs-with-fsspec-datapipes)

However when doing so, I notice that there are bouts of 0 utilisation on my GPU (whereas my GPU memory is constantly at ~ 80%). I presume that's because of I/O bottlenecks because it's piping data from a remote GCS bucket.

What's the most efficient way of loading data into my training application? Would it be to download my data into my training container than load data locally, rather than piping it from a GCS bucket?

Solution

I found this blog post from GCP that answers the question:

https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai

TL;DR - use torchdata.datapipes.iter.WebDataset.