Vertex AI: Batch prediction for custom model fails with RuntimeError: BatchPredictionJob resource has not been created

We are trying to run a batch prediction for a custom model.

The training was done after this tutorial: https://codelabs.developers.google.com/codelabs/vertex-ai-custom-code-training#4

The code to sumbit the job in a pipeline:

model = aiplatform.Model(model_path)
batch_prediction_job = model.batch_predict(
    gcs_source=gcs_source,
    gcs_destination_prefix=gcs_destination,
    machine_type='n1-standard-4',
    instances_format='csv',
    sync=False
)

Running the batch prediction job fails with the following error in the pipeline:

JobState.JOB_STATE_FAILED
[KFP Executor 2023-01-18 14:08:09,862 INFO]: BatchPredictionJob projects/472254905662/locations/us-central1/batchPredictionJobs/3522181183414730752 current state:
JobState.JOB_STATE_FAILED
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/kfp/v2/components/executor_main.py", line 104, in <module>
executor_main()
File "/usr/local/lib/python3.7/site-packages/kfp/v2/components/executor_main.py", line 100, in executor_main
executor.execute()
File "/usr/local/lib/python3.7/site-packages/kfp/v2/components/executor.py", line 309, in execute
result = self._func(**func_kwargs)
File "/tmp/tmp.ZqplJAZqqL/ephemeral_component.py", line 23, in create_batch_inference_component
print(f'Batch prediction job "{batch_prediction_job.resource_name}" submitted')
File "/usr/local/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 676, in resource_name
self._assert_gca_resource_is_available()
File "/usr/local/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 1324, in _assert_gca_resource_is_available
else ""
RuntimeError: BatchPredictionJob resource has not been created.

There is an error in the failed batch prediction job but it is not possible to understand what it means:

Batch prediction job BatchPredictionJob 2023-01-18 14:21:50.490123 encountered the following errors:

Model server terminated: model server container terminated: exit_code: 1 reason: "Error" started_at { seconds: 1674052639 } finished_at { seconds: 1674052640 }

Batch prediction for an AutoML model trained for the same Titanic dataset works.

There is no way to troubleshoot this. We have tried different instance_format, not specifying machine_type, improving the dataset for predictions (the guidelines say all string fields should be enclosed with double quotes) but this hasn't helped.

Solution

There were three issues with this problem that we have managed to solve with our team:

We used different containers for training and serving the model. We did not control the scikit-learn version in the container but set its version for model serving. We just installed the required scikit version in the container used for training.
We didn't know the correct format of the input for batch predictions. While there is a sample and examples in the documentation for online inference using endpoints, there are no samples of input files for batch prediction. The format is described in this answer: https://stackoverflow.com/a/68123138/2082681. You just need to pass lines of examples as arrays and (very important) have the source file with .jsonl extension.
And finally (!), even after our batch predictions started to work and produce a file with correct predictions, the pipeline which was submitting the batch prediction still failed (!) and with the same error: BatchPredictionJob resource has not been created. This was finally fixed by removing sync=False passed into model.batch_predict.

This took our team (3 people) about three weeks to figure out. Now, the pipeline is green and the batch predictions are working.