Search code examples
dockergoogle-cloud-platformgcloudgoogle-cloud-sdkgoogle-cloud-vertex-ai

How to pass environment variables to gcloud beta ai custom-jobs create with custom container (Vertex AI)


I'm running custom training jobs in google's Vertex AI. A simple gcloud command to execute a custom job would use something like the following syntax (complete documentation for the command can be seen here):

gcloud beta ai custom-jobs create --region=us-central1 \
--display-name=test \
--config=config.yaml

In the config.yaml file, it is possible to specify the machine and accelerator (GPU) types, etc., and in my case, point to a custom container living in the Google Artifact Registry that executes the training code (specified in the imageUri part of the containerSpec). An example config file may look like this:

# config.yaml
workerPoolSpecs:
  machineSpec:
    machineType: n1-highmem-2
    acceleratorType: NVIDIA_TESLA_P100
    acceleratorCount: 2
  replicaCount: 1
  containerSpec:
    imageUri: {URI_FOR_CUSTOM_CONATINER}
    args:
    - {ARGS TO PASS TO CONTAINER ENTRYPOINT COMMAND}

The code we're running needs some runtime environment variables (that need to be secure) passed to the container. In the API documentation for the containerSpec, it says it is possible to set environment variables as follows:

# config.yaml
workerPoolSpecs:
  machineSpec:
    machineType: n1-highmem-2
    acceleratorType: NVIDIA_TESLA_P100
    acceleratorCount: 2
  replicaCount: 1
  containerSpec:
    imageUri: {URI_FOR_CUSTOM_CONATINER}
    args:
    - {ARGS TO PASS TO CONTAINER ENTRYPOINT COMMAND}
    env:
    - name: SECRET_ONE
      value: $SECRET_ONE
    - name: SECRET_TWO
      value: $SECRET_TWO

When I try and add the env flag to the containerSpec, I get an error saying it's not part of the container spec:

ERROR: (gcloud.beta.ai.custom-jobs.create) INVALID_ARGUMENT: Invalid JSON payload received. Unknown name "env" at 'custom_job.job_spec.worker_pool_specs[0].container_spec': Cannot find field.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: "Invalid JSON payload received. Unknown name \"env\" at 'custom_job.job_spec.worker_pool_specs[0].container_spec':\
      \ Cannot find field."
    field: custom_job.job_spec.worker_pool_specs[0].container_spec

Any idea how to securely set runtime environment variables in Vertex AI custom jobs using custom containers?


Solution

  • There are two versions of the REST API - “v1” and “v1beta1” where "v1beta1" does not have the env option in ContainerSpec but "v1" does. The gcloud ai custom-jobs create command without the beta parameter doesn’t throw the error as it uses version “v1” to make the API calls.

    The environment variables from the yaml file can be passed to the custom container in the following way:

    This is the docker file of the sample custom training application I used to test the requirement. Please refer to this codelab for more information about the training application.

    FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3
    WORKDIR /root
    
    WORKDIR /
    
    # Copies the trainer code to the docker image.
    COPY trainer /trainer
    
    
    # Copies the bash script to the docker image.
    COPY commands.sh /scripts/commands.sh
    
    # Bash command to make the script file an executable
    RUN ["chmod", "+x", "/scripts/commands.sh"]
    
    
    # Command to execute the file
    ENTRYPOINT ["/scripts/commands.sh"]
    
    # Sets up the entry point to invoke the trainer.
    # ENTRYPOINT "python" "-m" $SECRET_TWO ⇒ To use the environment variable  
    # directly in the docker ENTRYPOINT. In case you are not using a bash script, 
    # the trainer can be invoked directly from the docker ENTRYPOINT.
    

    Below is the commands.sh file used in the docker container to test whether the environment variables are passed to the container.

    #!/bin/bash
    mkdir /root/.ssh
    echo $SECRET_ONE
    python -m $SECRET_TWO
    

    The example config.yaml file

    # config.yaml
    workerPoolSpecs:
      machineSpec:
        machineType: n1-highmem-2
      replicaCount: 1
      containerSpec:
        imageUri: gcr.io/infosys-kabilan/mpg:v1
        env:
        - name: SECRET_ONE
          value: "Passing the environment variables"
        - name: SECRET_TWO
          value: "trainer.train"
    

    As the next step, I built and pushed the container to Google Container Repository. Now, the gcloud ai custom-jobs create --region=us-central1 --display-name=test --config=config.yaml can be run to create the custom training job and the output of the commands.sh file can be seen in the job logs as shown below.

    enter image description here