Search code examples
google-cloud-platformtpugcp-ai-platform-training

Preemptible TPU - Data Managment while persistant disks are not available


I have access to 1 Preemptible Cloud TPU v3-32, and I want to train my LM on it, however, since it is preemptible, I can't attach a persistent disk (read-write mode) to it as it is also mentioned in Docs.
My dataset is around 100GB. These were the things I did but none worked:

  1. Preprocessed and Cached the data on another VM and saved them on PD then attached the PD to TPU in read-only mode: Write Permission Error for the time my code wants to lock the lock file.

  2. Using Google Buckets and TFDA to stream the data: The problem here is the caching, Space needed for caching is about 250GB which is not available.

I am using Jax/Flax and the script is available here. SCRIPT


Solution

  • TPU v3-32 has 4 hosts (each has 8 TPU cores attached), each one with 340 GB DRAM and about 100 GB disk storage. So if you wanted to shared your dataset 4 ways you could save it on 4 hosts.

    But I recommend storing your dataset in GCS bucket, and using distributed tf.data (or other options) to process, map, prefetch, batch in parallel on each hosts (each hosts needs to process 1/4 of the data from your dataset per epoch).

    https://www.tensorflow.org/guide/data_performance

    https://github.com/google/seqio is another option to consider.