python google-cloud-platform pytorch torchvision google-ai-platform

GCP AI Platform: Error when creating a custom predictor model version ( trained model Pytorch model + torchvision.transform)

Am currently trying to deploy a custom model to AI platform by following https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1. which is based on a combination of the pre-trained model from 'Pytorch' and 'torchvision.transform'. Currently, I keep getting below error which happens to be related to 500MB constraint on custom prediction.

ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.

Setup.py

from setuptools import setup
from pathlib import Path

base = Path(__file__).parent
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")]
print(f"\nPackages: {REQUIRED_PACKAGES}\n\n")

# [torch==1.3.0,torchvision==0.4.1, ImageHash==4.2.0
# Pillow==6.2.1,pyvis==0.1.8.2] installs 800mb worth of files

setup(description="Extract features of a image",
      author=,
      name='test',
      version='0.1',
      install_requires=REQUIRED_PACKAGES,
      project_urls={
                    'Documentation':'https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines#tensorflow',
                    'Deploy':'https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1',
                    'Ai_platform troubleshooting':'https://cloud.google.com/ai-platform/training/docs/troubleshooting',
                    'Say Thanks!': 'https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform- 
 7e42a5721b43',
                    'google Torch wheels':"http://storage.googleapis.com/cloud-ai-pytorch/readme.txt",
                    'Torch & torchvision wheels':"https://download.pytorch.org/whl/torch_stable.html "
                    },
    python_requires='~=3.7',
    scripts=['predictor.py', 'preproc.py'])

Steps taken: Tried adding ‘torch’ and torchvision directly to ‘REQUIRED_PACKAGES’ list in setup.py file in order to provide PyTorch + torchvision as a dependency to be installed while deployment. I am guessing, Internally Ai platform downloads PyPI package for PyTorch which is +500 MB, this results in the failure of our model deployment. If I just deploy the model with 'torch' only and it seems to be working (of course throws error for not able to find library 'torchvision')

File size

pytorch (torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl about 111MB)
torchvision (torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl about 46MB) from https://download.pytorch.org/whl/torch_stable.html and stored it on GKS.
The zipped predictor model file (.tar.gz format) which is the output of setup.py (5kb )
A trained PyTorch model (size 44MB)

In total, the model dependencies should be less than 250MB but still, keep getting this error. Have also tried to use the torch and torchvision provided from Google mirrored packages http://storage.googleapis.com/cloud-ai-pytorch/readme.txt, but same memory issue persists. AI platform is quite new for us and would like some input from professional’s.

MORE INFO:

GCP CLI Input:

My environment variable:

BUCKET_NAME= “something”
MODEL_DIR=<>
VERSION_NAME='v6'
MODEL_NAME="something_model"
STAGING_BUCKET=$MODEL_DIR<>
# TORCH_PACKAGE=$MODEL_DIR"package/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
# TORCHVISION_PACKAGE=$MODEL_DIR"package/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCH_PACKAGE=<>
TORCHVISION_PACKAGE=<>
CUSTOM_CODE_PATH=$STAGING_BUCKET"imt_ai_predict-0.1.tar.gz"
PREDICTOR_CLASS="predictor.MyPredictor"
REGION=<>
MACHINE_TYPE='mls1-c4-m2'
 
gcloud beta ai-platform versions create $VERSION_NAME   \
--model=$MODEL_NAME   \
--origin=$MODEL_DIR  \
 --runtime-version=2.3  \
 --python-version=3.7   \
--machine-type=$MACHINE_TYPE  \
 --package-uris=$CUSTOM_CODE_PATH,$TORCH_PACKAGE,$TORCHVISION_PACKAGE   \
--prediction-class=$PREDICTOR_CLASS \

GCP CLI Output:

 **[1] global**
 [2] asia-east1
 [3] asia-northeast1
 [4] asia-southeast1
 [5] australia-southeast1
 [6] europe-west1
 [7] europe-west2
 [8] europe-west3
 [9] europe-west4
 [10] northamerica-northeast1
 [11] us-central1
 [12] us-east1
 [13] us-east4
 [14] us-west1
 [15] cancel
Please enter your numeric choice:  1
 
To make this the default region, run `gcloud config set ai_platform/region global`.
 
Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......failed.                                                                                                                                            
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: **Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.**

My finding: Have found articles of people struggling in same ways for PyTorch package and made it work by installing torch wheels on the GCS (https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform- 7e42a5721b43). Have tried the same approach with torch and torchvision but no luck till now and waiting response from "cloudml-feedback@google.com cloudml-feedback@google.com". Any help on getting custom torch_torchvision based custom predictor working on AI platform that will be great.

Solution

Got this fixed by a combination of few things. I stuck to 4gb CPU MlS1 machine and custom predictor routine (<500MB).

Install the libraries using setup.py parameter but instead of parsing just the package name and it's version, add correct torch wheel (ideally <100 mb).

REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")] +\
['torchvision==0.5.0', 'torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl']

I reduced the steps for preprocessing taken. Couldn't fit in all of them so jsonify your SEND response and GET one from both preproc.py and predictor.py

import json
json.dump(your data to send to predictor class)

Import those function from the class of a library that is needed.

from torch import zeros,load 
    your code

[Important]

Haven't tested different types of serializing object for the trained model, could be a difference there as well to which one (torch.save, pickle, joblib etc) is memory saving.
Found this link for those whose organization is a partner with GCP might be able to request more quota (am guessing from 500MB to 2GB or so). Didn't had to go in this direction as my issue was resolved and other poped up lol. https://cloud.google.com/ai-platform/training/docs/quotas