I'm running custom training jobs in google's Vertex AI. A simple gcloud
command to execute a custom job would use something like the following syntax (complete documentation for the command can be seen here):
gcloud beta ai custom-jobs create --region=us-central1 \
--display-name=test \
--config=config.yaml
In the config.yaml
file, it is possible to specify the machine and accelerator (GPU) types, etc., and in my case, point to a custom container living in the Google Artifact Registry that executes the training code (specified in the imageUri
part of the containerSpec
). An example config file may look like this:
# config.yaml
workerPoolSpecs:
machineSpec:
machineType: n1-highmem-2
acceleratorType: NVIDIA_TESLA_P100
acceleratorCount: 2
replicaCount: 1
containerSpec:
imageUri: {URI_FOR_CUSTOM_CONATINER}
args:
- {ARGS TO PASS TO CONTAINER ENTRYPOINT COMMAND}
The code we're running needs some runtime environment variables (that need to be secure) passed to the container. In the API documentation for the containerSpec
, it says it is possible to set environment variables as follows:
# config.yaml
workerPoolSpecs:
machineSpec:
machineType: n1-highmem-2
acceleratorType: NVIDIA_TESLA_P100
acceleratorCount: 2
replicaCount: 1
containerSpec:
imageUri: {URI_FOR_CUSTOM_CONATINER}
args:
- {ARGS TO PASS TO CONTAINER ENTRYPOINT COMMAND}
env:
- name: SECRET_ONE
value: $SECRET_ONE
- name: SECRET_TWO
value: $SECRET_TWO
When I try and add the env
flag to the containerSpec
, I get an error saying it's not part of the container spec:
ERROR: (gcloud.beta.ai.custom-jobs.create) INVALID_ARGUMENT: Invalid JSON payload received. Unknown name "env" at 'custom_job.job_spec.worker_pool_specs[0].container_spec': Cannot find field.
- '@type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: "Invalid JSON payload received. Unknown name \"env\" at 'custom_job.job_spec.worker_pool_specs[0].container_spec':\
\ Cannot find field."
field: custom_job.job_spec.worker_pool_specs[0].container_spec
Any idea how to securely set runtime environment variables in Vertex AI custom jobs using custom containers?
There are two versions of the REST API - “v1” and “v1beta1” where "v1beta1" does not have the env
option in ContainerSpec
but "v1" does. The gcloud ai custom-jobs create
command without the beta
parameter doesn’t throw the error as it uses version “v1” to make the API calls.
The environment variables from the yaml file can be passed to the custom container in the following way:
This is the docker file of the sample custom training application I used to test the requirement. Please refer to this codelab for more information about the training application.
FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3
WORKDIR /root
WORKDIR /
# Copies the trainer code to the docker image.
COPY trainer /trainer
# Copies the bash script to the docker image.
COPY commands.sh /scripts/commands.sh
# Bash command to make the script file an executable
RUN ["chmod", "+x", "/scripts/commands.sh"]
# Command to execute the file
ENTRYPOINT ["/scripts/commands.sh"]
# Sets up the entry point to invoke the trainer.
# ENTRYPOINT "python" "-m" $SECRET_TWO ⇒ To use the environment variable
# directly in the docker ENTRYPOINT. In case you are not using a bash script,
# the trainer can be invoked directly from the docker ENTRYPOINT.
Below is the commands.sh
file used in the docker container to test whether the environment variables are passed to the container.
#!/bin/bash
mkdir /root/.ssh
echo $SECRET_ONE
python -m $SECRET_TWO
The example config.yaml
file
# config.yaml
workerPoolSpecs:
machineSpec:
machineType: n1-highmem-2
replicaCount: 1
containerSpec:
imageUri: gcr.io/infosys-kabilan/mpg:v1
env:
- name: SECRET_ONE
value: "Passing the environment variables"
- name: SECRET_TWO
value: "trainer.train"
As the next step, I built and pushed the container to Google Container Repository. Now, the gcloud ai custom-jobs create --region=us-central1 --display-name=test --config=config.yaml
can be run to create the custom training job and the output of the commands.sh
file can be seen in the job logs as shown below.