We are currently using Rabbit MQ with Celery on some VMs for this:
- We have a batch of tasks we want to process in parallel (e.g. process some files concurrently or run some machine learning inference for images)
- When the batch is done we callback our App, get the results and start some other batch of tasks which might depend on the results of the previously executed tasks
So we have the requirements:
- Our App needs to know when a batch is done
- Our App needs the gathered task results across the batch
- It might kill the App when we do a callback to the App in every single task that succeeds
Now we try to use Google Cloud for this and we would like to move away from VMs to something like Google Cloud Tasks or Pub / Sub in combination with Google Cloud Functions. Is there any best practice setup for our problem in Google Cloud?
Google Cloud offers, today, only one workflow manager named Cloud Composer (based on Apache Airflow project) (I don't take into account the AI Platform workflow manager (AI Pipeline)). This managed solution allow you to perform the same things than you do today with Celery
- An event occur
- A Cloud Function is called to process the event
- The Cloud Function trigger a DAG (Diagram Acyclic Graph - a workflow in Airflow)
- A step in the DAG runs a lot of sub process (Cloud Function/Cloud Run/anything else) wait the end, and continue to the next step...
2 warnings:
- Composer is expensive (about $400 per month, for the minimal config)
- DAG are acyclic. no loops are authorized
Note: A new workflow product should come on GCP. No ETA for now, and at the beginning the parallelism want be managed. IMO, this solution is the right one for you, but not for short term, maybe in 12 months
About the MQTT queue, you can use PubSub, very efficient and affordable.
Alternative
You can build your own system following this process
- An event occur
- A Cloud Function is called to process the event
- The cloud function create as many PubSub message as batched are required.
- For each message generated, you write an entry into Firestore with the initial event, and the messageId
- The generated messages are consumed (by Cloud Function, Cloud Run or anything else) and at the end of the process, the Firestore entry is updated saying that the sub process has been completed
- You plug a Cloud Function on Firestore event On Write. The function checks if all the subprocess for an initial event are completed. If so, go to the next step...
We have implemented a similar workflow in my company. But it's not easy to maintain and to debug when a problem occur. Else, it works great.