We are trying to run a batch prediction for a custom model.
The training was done after this tutorial: https://codelabs.developers.google.com/codelabs/vertex-ai-custom-code-training#4
The code to sumbit the job in a pipeline:
model = aiplatform.Model(model_path)
batch_prediction_job = model.batch_predict(
gcs_source=gcs_source,
gcs_destination_prefix=gcs_destination,
machine_type='n1-standard-4',
instances_format='csv',
sync=False
)
Running the batch prediction job fails with the following error in the pipeline:
JobState.JOB_STATE_FAILED
[KFP Executor 2023-01-18 14:08:09,862 INFO]: BatchPredictionJob projects/472254905662/locations/us-central1/batchPredictionJobs/3522181183414730752 current state:
JobState.JOB_STATE_FAILED
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/kfp/v2/components/executor_main.py", line 104, in <module>
executor_main()
File "/usr/local/lib/python3.7/site-packages/kfp/v2/components/executor_main.py", line 100, in executor_main
executor.execute()
File "/usr/local/lib/python3.7/site-packages/kfp/v2/components/executor.py", line 309, in execute
result = self._func(**func_kwargs)
File "/tmp/tmp.ZqplJAZqqL/ephemeral_component.py", line 23, in create_batch_inference_component
print(f'Batch prediction job "{batch_prediction_job.resource_name}" submitted')
File "/usr/local/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 676, in resource_name
self._assert_gca_resource_is_available()
File "/usr/local/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 1324, in _assert_gca_resource_is_available
else ""
RuntimeError: BatchPredictionJob resource has not been created.
There is an error in the failed batch prediction job but it is not possible to understand what it means:
Batch prediction job BatchPredictionJob 2023-01-18 14:21:50.490123 encountered the following errors:
Model server terminated: model server container terminated: exit_code: 1 reason: "Error" started_at { seconds: 1674052639 } finished_at { seconds: 1674052640 }
Batch prediction for an AutoML model trained for the same Titanic dataset works.
There is no way to troubleshoot this. We have tried different instance_format
, not specifying machine_type
, improving the dataset for predictions (the guidelines say all string fields should be enclosed with double quotes) but this hasn't helped.
There were three issues with this problem that we have managed to solve with our team:
BatchPredictionJob resource has not been created.
This was finally fixed by removing sync=False
passed into model.batch_predict
.This took our team (3 people) about three weeks to figure out. Now, the pipeline is green and the batch predictions are working.