google-cloud-platform google-cloud-pubsub google-cloud-tasks

How to use Google Cloud Functions / Tasks / PubSub for Batch Processing?

We are currently using Rabbit MQ with Celery on some VMs for this:

We have a batch of tasks we want to process in parallel (e.g. process some files concurrently or run some machine learning inference for images)
When the batch is done we callback our App, get the results and start some other batch of tasks which might depend on the results of the previously executed tasks

So we have the requirements:

Our App needs to know when a batch is done
Our App needs the gathered task results across the batch
It might kill the App when we do a callback to the App in every single task that succeeds

Now we try to use Google Cloud for this and we would like to move away from VMs to something like Google Cloud Tasks or Pub / Sub in combination with Google Cloud Functions. Is there any best practice setup for our problem in Google Cloud?

Solution

Google Cloud offers, today, only one workflow manager named Cloud Composer (based on Apache Airflow project) (I don't take into account the AI Platform workflow manager (AI Pipeline)). This managed solution allow you to perform the same things than you do today with Celery

An event occur
A Cloud Function is called to process the event
The Cloud Function trigger a DAG (Diagram Acyclic Graph - a workflow in Airflow)
A step in the DAG runs a lot of sub process (Cloud Function/Cloud Run/anything else) wait the end, and continue to the next step...

2 warnings:

Composer is expensive (about $400 per month, for the minimal config)
DAG are acyclic. no loops are authorized

Note: A new workflow product should come on GCP. No ETA for now, and at the beginning the parallelism want be managed. IMO, this solution is the right one for you, but not for short term, maybe in 12 months

About the MQTT queue, you can use PubSub, very efficient and affordable.

Alternative

You can build your own system following this process

An event occur
A Cloud Function is called to process the event
The cloud function create as many PubSub message as batched are required.
For each message generated, you write an entry into Firestore with the initial event, and the messageId
The generated messages are consumed (by Cloud Function, Cloud Run or anything else) and at the end of the process, the Firestore entry is updated saying that the sub process has been completed
You plug a Cloud Function on Firestore event On Write. The function checks if all the subprocess for an initial event are completed. If so, go to the next step...

We have implemented a similar workflow in my company. But it's not easy to maintain and to debug when a problem occur. Else, it works great.