I'm currently looking at designing a system that needs to basically run a job every hour, but for roughly 10,000 rows. Each one of those rows would then need to call an external api and do some other bits in terms of analytics.
I'm currently trying to work out the best way to achieve this, but I've not had to do anything like this before so would appreciate and advice or guidance anyone has. I'm primarily used to GCP so I've focussed my ideas on the tooling available there (This will also most likely be done in a JS/Node environment).
My initial thoughts on the design are as follows.
My reasoning for having the first function add items to a queue is obviously a cloud function is limited by execution time and memory, so I didn't think it prudent to have one function try and process all rows by itself. My assumption that pub/sub would trigger a new instance of the function each time rather than overwrite the first?
I think I can in theory batch some of the external API calls, maybe upto around 20 at a time, so I don't know if that would/should have an impact on the above design.
I obviously want this to cost as little as possible as well, so I don't know whether having an app engine instance do this would be better? But then I also don't know if I'd run into memory and timeout issues there.
A thought that's occurred to me as I write this is whether I could batch the batches as it were. Coming from a JS background I could create all the batch API calls and execute them in a Promise.all() call. Again not sure on impact on memory and performance with that so I guess I would need to test it.
Does anyone notice any gaping holes in the above or would there be a better solution to this?
Thanks
The first part of your design is correct (Cloud Scheduler -> CLoud Functions -> message in PubSub).
Here, a Cloud Functions is called on each message. IMO, it's not the best choice because you can process only 1 request on Cloud Functions instance at a time. And if you perform an external API call, you will waste time for nothing (you will wait the answer, doing nothing).
A better solution is to use a product that manage concurrent requests, such as Cloud Run or App Engine. With Cloud Run you can have up to 250 concurrent requests, but only 80 with App Engine.
You will save a lot of money, and thus time, by using this kind of solution.
About the batch processing, I'm not sure to understand.
In fact, you will reduce the number of calls (but it's really really cheap) and, at the opposite, increase the complexity of your code. Not sure that worth.
EDIT 1
In fact, PubSub doesn't spawn any Cloud Run instance. PubSub subscription only push the messages to a URL. The job of PubSub ends here.
Now, on Cloud Run side, the service scale according to the HTTP traffic. And thus, the platform chooses to create 1, 2 or more instance to absorb the traffic. In your case, the platform will create a lot of instance (I think about 100) but you pay only with the instance process the traffic. No request processing, no billing.
You can also limit the number of parallel instances on Cloud Run with the max instance parameter. With it, you can limit the cost, but also the processing capacity.
Now, about latency, there is of course different sources.