I have successfully run my model in GCP in Vertex AI but when I try to source batch predictions, it hangs.
When I run the model in my local environment, it is done in seconds. The model does take 8 minutes to calculate on GCP.
My model code is here:
from google.cloud import storage
import os
import gcsfs
import pandas as pd
import pickle
#read in the file
print("Mod script starts")
ds = pd.read_csv("gs://shottypeids/ShotTypeModel_alldata.csv")
print("Data read in success")#model
import xgboost as xgb
from sklearn.model_selection import train_test_split
y=ds[["Label_Num","ShotPlus"]]
#y["Player"]=shots2["Player"]
#adjust in i
X=ds.drop(["ShotPlus", "Label_Num",
#,"DSL_Available_Bandwidth","Band_2_DSL_rel","DSL_vals"
],axis=1)
X_train, X_test, y_train1,y_test1=train_test_split(X,y, test_size=0.3, random_state=785)
y_test = y_test1[["Label_Num"]]
y_train = y_train1[["Label_Num"]]
dtrain=xgb.DMatrix(X_train,label=y_train)
dtest=xgb.DMatrix(X_test,label=y_test)
params={
'max_depth':6,
'min_child_weight': 4,
'eta':0.1,
'subsample': 0.8,
'colsample_bytree': 0.8,
# "scale_pos_weight" : 8, #change me
# Other parameters
# 'eval_metric' : "auc",
'objective':'multi:softprob',
"num_class":7,
'seed':123
}
num_boost_round = 999
print("Mod Prep Success")
mod_addK=xgb.train(params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")],
early_stopping_rounds=10)
print("Mod Run")
artifact_filename = 'ShotTypeModel_2pt1.pkl'
# Save model artifact to local filesystem (doesn't persist)
local_path = artifact_filename
with open(local_path, 'wb') as model_file:
pickle.dump(mod_addK, model_file)
# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
storage_path = os.path.join(model_directory, artifact_filename)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
print("Model artefacts saved")
I look at the logs, there are a couple of errors about pip but it runs and completes.
I then have the model in the models tab on GCP and have it saved as artefacts in Cloud Storage. I set up a batch job on a csv file and it just hangs for ages. I thought perhaps it struggled because I had not immediately put it in a container, so I re ran it and loaded it in the same container as I used to train (xgb 1.1)
Its now been running for over 45 minutes and the prior attempts were over half an hour. I cancelled the last jobs and it says that's due to starting model server time out and I should check the container spec. I have not found any information on what I should do there.
I've followed the instruictions here to the letter, but its just hanging. I can not get the API working but I just ran this in the cloud shell, rather than a VM, so will be trying that next.
Any tips welcome, J
So simple answer to this appears to be that the file literally has to be saved as "model.pkl". I assumed that the name before the extension could vary but no.
I am still struggling to make a prediction be generated but it now returns the failure within 15 minutes or so