Search code examples
tensorflow2.0tpugoogle-cloud-tpugcp-ai-platform-traininggoogle-ai-platform

How to effectively use the TFRC program with the GCP AI platform Jobs


I'm trying to run a hyperparameter tunning job into GCP's AI platform job service, the Tensorflow Research Cloud program approved to me

  • 100 preemptible Cloud TPU v2-8 device(s) in zone us-central1-f
  • 20 on-demand Cloud TPU v2-8 device(s) in zone us-central1-f
  • 5 on-demand Cloud TPU v3-8 device(s) in zone europe-west4-a

I already built a custom model on Tensorflow 2, and I want to run the job specifying the exact zone to take advantage of the TFRC program plus the AI platform job service; right now I have a YAML config file that looks like:

trainingInput:
  scaleTier: basic-tpu
  region: us-central1
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: val_accuracy
    maxTrials: 100
    maxParallelTrials: 16
    maxFailedTrials: 30
    enableTrialEarlyStopping: True

In theory, if I run 16 parallel jobs each one in a separate TPU instance should work but, instead return an error due to the petition exceed the quota of TPU_V2

ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project ###################. The request for 128 TPU_V2 accelerators for 16 parallel runs exceeds the allowed maximum of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 accelerators.

Then I reduce the maxParallelTrials to only 2 and worked, which confirms given the above error message the quota is counting by TPU chip, not by TPU instance.

Therefore I think, maybe I completely misunderstood the approved quota of the TFRC program then I proceed to check if the job is using the us-central1-f zone but turns out that is using an unwanted zone:

-tpu_node={"project": "p091c8a0a31894754-tp", "zone": "us-central1-c", "tpu_node_name": "cmle-training-1597710560117985038-tpu"}"

That behavior doesn't allow me to use effectively the free approved quota, and if I understand correctly the job running in the us-central1-c is taking credits of my account but does not use the free resources. Hence I wonder if there's some way to set the zone in the AI platform job, and also it is possible to pass some flag to use preemptible TPUs.


Solution

  • Unfortunately the two can't be combined.