I have multiple Python pipelines that I want to run on the Google Platform as on-demand jobs. To simplify the process of installing dependencies and downloading/preparing ML models, I created a Docker image containing all the necessary requirements. The entrypoint
of the image is set to my main.py
file, which takes unique arguments for each run. I want to be able to run the image using the GC Platform similar to a local run:
docker run -rm ${IMAGE} arg1 arg2 arg3
I have successfully uploaded the images, but I am unsure about how to proceed with setting up the next stage of my pipeline. Is it possible to store the image and run it when needed, or do I need to create an API endpoint and host the pipeline as an API?
You can't run images like that dynamically. You have to deploy services (Cloud Run, Cloud Run Jobs) and then invoke them. Even if you use Kubernetes (GKE on Google Cloud) you have to deploy the pod before invoking it.
So, deploy your containers on the service that you which and invoke them with the arg that you need for your pipeline. Cloud Run Jobs with parameter override is the best place to achieve this IMO.
As orchestrator, you can use Cloud Workflow (my fav) or Composer (Managed Airflow).
Edit 1
Indeed, at the beginning, Cloud Run Job had its args and env vars static and bind to the deployment revision.
Many of alpha testers, me included, asked for ability to pass parameter to the job. And the solution is the parameter override.
In my case, it was for DBT: I deploy a single container, but I want to change the arg sent to DBT to run a specific model, or all. The parameter override helps me in this task.
About the API, of course, all the Google Cloud services are usable by API for automation. That's why I recommended Cloud Workflow or Composer to achieve that.