Search code examples
google-cloud-platformgpuquotagcp-ai-platform-training

GCP AI platform training cannot use full GPU quota


On GCP -> IAM & admin -> Quotas page, Service "Compute Engine API NVidia V100 GPUs" for us-central1 show Limit is 4. But when I submit training job on GCP AI platform using the commands below, I got an error saying maximum allowed V100 GPUs are 2. enter image description here

Here is the command:

gcloud beta ai-platform jobs submit training $JOB_NAME \
    --staging-bucket $PACKAGE_STAGING_PATH \
    --job-dir $JOB_DIR  \
    --package-path $TRAINER_PACKAGE_PATH \
    --module-name $MAIN_TRAINER_MODULE \
    --python-version 3.5 \
    --region us-central1 \
    --runtime-version 1.14 \
    --scale-tier custom \
    --master-machine-type n1-standard-8 \
    --master-accelerator count=4,type=nvidia-tesla-v100 \
    -- \
    --data_dir=$DATA_DIR \
    --initial_epoch=$INITIAL_EPOCH \
    --num_epochs=$NUM_EPOCHS

Here is the error message:

ERROR: (gcloud.beta.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project [PROJECT_ID]. The request for 4 V100 accelerators exceeds the allowed m
aximum of 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.
- '@type': type.googleapis.com/google.rpc.QuotaFailure
  violations:
  - description: The request for 4 V100 accelerators exceeds the allowed maximum of
      16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4.
    subject: [PROJECT_ID]

Here is the GPUs on Compute Engine webpage saying that 8 NVIDIA® Tesla® V100 GPUs are available in zones us-central1-a, us-central1-b, us-central1-c, and us-central1-f. My default zone is us-central1-c.

What should I do to use all 4 V100 GPUs for the training? Thanks.

UPDATE 1 (1/14/2020): On this page, it says something about the global GPU quota that needs to be increased to match the per-region quota. But I couldn't find it anywhere on the Quota page.

To protect Compute Engine systems and users, new projects have a global GPU quota, which limits the total number of GPUs you can create in any supported zone. When you request a GPU quota, you must request a quota for the GPU models that you want to create in each region, and an additional global quota for the total number of GPUs of all types in all zones.

Update 2 (1/14/2020): I contacted GCP to increase the global GPU quota to match my region quota. They replied that for some projects this is needed, but for my project there is no need to do it.


Solution

  • Google people told me "there is a V100 GPUS quota, and a V100 VWS GPUS quota. The VWS quota in your project is only 1. Not sure which one is needed here, but that might have been the root cause." After they adjusted the quota, now I can attach up to 8 V100 GPUs for training jobs.