google-cloud-platform architecture publish-subscribe google-cloud-pubsub

Job to call external API every hour and perform tasks for ~10000 rows individually

I'm currently looking at designing a system that needs to basically run a job every hour, but for roughly 10,000 rows. Each one of those rows would then need to call an external api and do some other bits in terms of analytics.

I'm currently trying to work out the best way to achieve this, but I've not had to do anything like this before so would appreciate and advice or guidance anyone has. I'm primarily used to GCP so I've focussed my ideas on the tooling available there (This will also most likely be done in a JS/Node environment).

My initial thoughts on the design are as follows.

Use Cloud Scheduler to create a job that runs every hour
Cloud scheduler triggers a cloud function
Cloud function retrieves all necessary rows and creates a pub/sub message in a topic for each row.
Cloud pub sub then triggers another cloud function that calls the external API and performs other tasks for that row and writes the data back where it needs to go.
Fin

My reasoning for having the first function add items to a queue is obviously a cloud function is limited by execution time and memory, so I didn't think it prudent to have one function try and process all rows by itself. My assumption that pub/sub would trigger a new instance of the function each time rather than overwrite the first?

I think I can in theory batch some of the external API calls, maybe upto around 20 at a time, so I don't know if that would/should have an impact on the above design.

I obviously want this to cost as little as possible as well, so I don't know whether having an app engine instance do this would be better? But then I also don't know if I'd run into memory and timeout issues there.

A thought that's occurred to me as I write this is whether I could batch the batches as it were. Coming from a JS background I could create all the batch API calls and execute them in a Promise.all() call. Again not sure on impact on memory and performance with that so I guess I would need to test it.

Does anyone notice any gaping holes in the above or would there be a better solution to this?

Thanks

Solution

The first part of your design is correct (Cloud Scheduler -> CLoud Functions -> message in PubSub).

Here, a Cloud Functions is called on each message. IMO, it's not the best choice because you can process only 1 request on Cloud Functions instance at a time. And if you perform an external API call, you will waste time for nothing (you will wait the answer, doing nothing).

A better solution is to use a product that manage concurrent requests, such as Cloud Run or App Engine. With Cloud Run you can have up to 250 concurrent requests, but only 80 with App Engine.

You will save a lot of money, and thus time, by using this kind of solution.

About the batch processing, I'm not sure to understand.

If you can send, in 1 request to the external API, 20 values contained in 20 messages, yes it's better to batch the request (create chunk of 20 messages in your first Cloud Functions)
If you continue to send the request one by one but you use the concurrency capacity of the language (Node or Go are very handful for this), there is no real advantage compare to process one by one the message.

In fact, you will reduce the number of calls (but it's really really cheap) and, at the opposite, increase the complexity of your code. Not sure that worth.

EDIT 1

In fact, PubSub doesn't spawn any Cloud Run instance. PubSub subscription only push the messages to a URL. The job of PubSub ends here.

Now, on Cloud Run side, the service scale according to the HTTP traffic. And thus, the platform chooses to create 1, 2 or more instance to absorb the traffic. In your case, the platform will create a lot of instance (I think about 100) but you pay only with the instance process the traffic. No request processing, no billing.

You can also limit the number of parallel instances on Cloud Run with the max instance parameter. With it, you can limit the cost, but also the processing capacity.

Now, about latency, there is of course different sources.

When you publish the message in PubSub, between the first created and the 10kth there is "latency".
Every time that Cloud Run platform create a new instance, the instance need to start and to initialize its runtime environment (it's named cold start); according to you dev language and design, it could take few ms (about 200 - 500) or several seconds (with Spring Boot for example). You can imagine using the min instance feature to keep warm a number of instances and thus limit the number of clod start. However, for 1 run per hour, this feature could be too expensive for you (IMO, I won't recommend this)
On a same instance, if you handle 250 requests concurrently, they have to share the same CPU resource and some request will wait to get CPU time for their processing. You can increase the number of CPU to reduce this latency (for example set 4vCPU), but it's a normal behavior of any multi-threaded system.