Search code examples
google-cloud-platformgcp-ai-platform-traininggoogle-ai-platform

Cheaper alternative to orchestrate workflow


I am currently putting an ML model in production and was researching which tool should I rely on to orchestrate the entire process:

  1. Grab the data from BQ.
  2. Do some feature engineering and general data pre-processing
  3. Create train/test splits.
  4. Dump data in .csv format (in a gs bucket)
  5. Start training job and save model artifact

In a separate stage model will be used to:

  1. Retrieve batch predictions from inputs taken from a BQ table.
  2. Insert predictions computed above in a different BQ table.

Now, I completely understand that orchestrating such workflow would be Cloud Composer sweet-spot but the project is for a non-profit organization and the €381/month pricing wouldn't be trivial for them.

I am therefore left thinking about the following options:

Package everything into the training task

One option could simply be not to break down training workflow by packaging everything up in the AI platform training task.

Use Cloud Functions to start/stop Composer

I was thinking something in the lines of:

  1. Start Composer instance with a Cloud Function
  2. Find a way to start the Airflow workflow from another Cloud Function
  3. Send a pub/sub message once the workflow is over
  4. Consume the aforementioned pub/sub message to fire another Cloud Function that will stop the Composer instance.

Self host Airflow in a small VM

This would obviously entail a bit more research to get it to work. Especially because I'd have no idea how to implement oauth in Nginx.

Dockerize everything and use CloudRun for training

This would probably look like:

  1. Package training job in a Docker container (with different entrypoints for train/serve tasks)
  2. Fire Cloudrun using Cloud Scheduler (or a Cloud Function, still have to wrap my head around this) and perform all the needed tasks.

AFAIK Cloudrun still has a 15min hard-cap on runtime so this options might not be viable after all...


How should I tackle this? I am not sure if I've overlooked anything simpler than the options listed above.


Solution

  • There's this recent product https://cloud.google.com/workflows, which you can use to manage, for example, calls to the BQ api to create intermediate tables (with feature eng and transformations) , then export the data and finally trigger the model training. Probably the workflow orchestration would be free, since there's a free tier at the moment, you would pay for BQ queries, storage, and training only.