Search code examples
google-cloud-platformcontainersgoogle-cloud-vertex-ai

Vertex AI custom job to run python-module with pre-built containers (using gcloud CLI)


I am updating a model that is previously running on gcp ai-platform to vertex ai [1, 2].

The settings that I am looking for are as below.

  • Vertex AI custom job with pre-built containers (using gcloud CLI)
  • to run a custom python-module which contains the code of the training phase of our model

Can someone help me if there is something wrong with the below sequence of the task?

It does not seem the python module is the cause of the problem since it is the same code that is currently running well with ai-platform.

Python3 module packaging

# simplified python module structure
# ./vertex-ai-poc
# ├── __init__.py
# ├── trainer
# │   ├── __init__.py
# │   └── task.py
# └── setup.py

python3 ./[PATH]/vertex-ai-poc/setup.py sdist --formats=gztar
# -> dist generated

gsutil cp dist/trainer-0.2.tar.gz gs://[PROJECT_ID]/vertex-ai-poc/trainer-0.2.tar.gz
# -> uploaded correctly

Submit Custom Job

gcloud ai custom-jobs create \
    --region us-central1 \
    --display-name=vertex-ai-poc \
    --project=[PROJECT_ID] \
    --python-package-uris='gs://[PROJECT_ID]/vertex-ai-poc/trainer-0.2.tar.gz' \
    --worker-pool-spec=machine-type=e2-standard-4,replica-count=1,executor-image-uri='us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest',python-module=trainer.task

However, I am encountering the below errors.

Error Messages

file:///user_dir/trainer-0.2.tar.gz does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

c.f. I am noticing file:/// with 3 slashes. And belive there is something to do with docker. [3]

enter image description here

enter image description here

References


Solution

  • I end up fixing the problem. I'll share the situation for those of you with a similar error. The problem was that I wasn't using find_packages() correctly.

    First, there are three possible ways of submitting custom vertex-ai jobs.

    1. auto packaging
    2. without auto packaging - Custom container image
    3. without auto packaging - Python App
      1. using local-package-path param
      2. using --python-package-uris flag

    (I believe) Method 1, 2, and 3.1 build docker images in the local machine and submit the built image to vertex-ai. Method 3.2 simply uses a pre-built container and combines python packages at executor-image-uri in vertex-ai.

    ** The problem was that when I run the below command to generate the dist package, I ran it from ../.. with ./[PATH]/. and ended up not correctly getting the find_packages() values which lead to both 3.1 and 3.2 methods not correctly running.

    # Error: python3 ./[PATH]/vertex-ai-poc/setup.py sdist --formats=gztar`
    python3 ./setup.py sdist --formats=gztar`
    
    from setuptools import find_packages, setup
    
    setup(
        name='trainer',
        version='0.1',
        packages=find_packages(),  # <-- HERE
        include_package_data=True,
    )
    

    The fixed version of local-package and external uris end up making the below script work.

    3.1 without auto packaging - Python App - using local-package-path param
    gcloud ai custom-jobs create \
        --region us-central1 \
        --display-name=vertex-ai-poc \
        --project=[PROJECT_ID] \
        --worker-pool-spec=machine-type=e2-standard-4,replica-count=1,executor-image-uri='us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest',script=task.py,local-package-path=vertex-ai-poc/trainer
    
    3.2 Without auto packaging - Python App - using --python-package-uris flag
    gcloud ai custom-jobs create \
        --region us-central1 \
        --display-name=vertex-ai-poc \
        --project=[PROJECT_ID] \
        --python-package-uris='gs://[PROJECT_ID]/vertex-ai-poc/trainer-0.1.tar.gz' \
        --worker-pool-spec=machine-type=e2-standard-4,replica-count=1,executor-image-uri='us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest',python-module=trainer.task
    
    Results

    enter image description here