google-cloud-platform tpu gcp-ai-platform-training

Preemptible TPU - Data Managment while persistant disks are not available

I have access to 1 Preemptible Cloud TPU v3-32, and I want to train my LM on it, however, since it is preemptible, I can't attach a persistent disk (read-write mode) to it as it is also mentioned in Docs.
My dataset is around 100GB. These were the things I did but none worked:

Preprocessed and Cached the data on another VM and saved them on PD then attached the PD to TPU in read-only mode: Write Permission Error for the time my code wants to lock the lock file.
Using Google Buckets and TFDA to stream the data: The problem here is the caching, Space needed for caching is about 250GB which is not available.

I am using Jax/Flax and the script is available here. SCRIPT

Solution

TPU v3-32 has 4 hosts (each has 8 TPU cores attached), each one with 340 GB DRAM and about 100 GB disk storage. So if you wanted to shared your dataset 4 ways you could save it on 4 hosts.

But I recommend storing your dataset in GCS bucket, and using distributed tf.data (or other options) to process, map, prefetch, batch in parallel on each hosts (each hosts needs to process 1/4 of the data from your dataset per epoch).

https://www.tensorflow.org/guide/data_performance

https://github.com/google/seqio is another option to consider.