Search code examples
design-patternsarchitecturesaasrate-limitingfault-tolerance

Rate Limiting and Pricing in Distributed Systems


I'm trying to build an application which relies on several 3rd party integrations in which there will be a UI and corresponding CRUD to define a "job" which will more than likely be enqueued and processed by workers at scale that will actually execute the business logic for this job definition in their own processes.

The first question is what is the best way to deal with 3rd party rate limits? For example ChatGPT has a limited number of requests/tokens per minute, but each consuming worker can call the completion feature several times if that's what the job definition requires, causing my GPT instance to be over throttled. Really can't wrap my mind around how to navigate this without potentially compromising the order of the processed events or queueing failed events ad nauseum until they pass through the 3rd party successful, which may or may not ever come depending on the load.

The other related question, is what's the best way to price your application when some API keys are flat rate per month and others are usage based according to number of requests or number of tokens as in the case of ChatGPT?


Solution

  • I'm unfamiliar with the ChatGPT API (and its restrictions) so I try to answer your first question in a bit more general fashion.

    Before we delve into the details let's clarify some concepts:

    • Rate limiting: The number of requests allowed per a predefined time frame

    • Burst limiting: The maximum number of concurrent requests allowed per a predefined short time frame

    • Throttling: It is a scoped (per service, per user, per whatever) rate limiting

    • Load shedding: Service discards some incoming requests

    • Back pressure: Service informs clients to slow down

    • Debouncing: Client holds back a request (or group of requests) for a predefined period before issuing it (them)

    • Rate gating: It is a debouncing implementation by utilizing queues and timers

    Service-side

    Usually we can categorize the actions that a service can take in case of request flooding into proactive and reactive. Proactive means the service tries to prevent to be overloaded by early detection of warning signs and making "counteraction" against overloading. On the other hand reactive mechanisms try to mitigate the problem by handling the already occurring flooding.

    Rate limiting, Burst limiting and Throttling can be considered as proactive because they try to prevent over-flooding by specifying maximum resource usage. On the other hand Load shedding and Back pressure can be considered as reactive because they try to handle the situation by taking some action to reduce the number of incoming requests. Load shedding does its action without requiring anything from the clients, while Back pressure requires actions from the clients.

    Client-side

    Whenever a client sends requests in a greater frequency or volume than the downstream system can handle then it got informed about this fact.

    • Rate-limiting, Burst-limiting, Throttling: Requests above the thresholds are rejected. The service response might indicate when the client should re-send the request. In other words when the current sampling period will be over.
    • Load shedding: Some requests are rejected (either based on their priority or randomly or by any other way). Services usually do not indicate reasoning or any information about retry.
    • Back pressure: Service may or may not discard incoming requests (it is an implementation detail). But the service informs the client to slow down by either emit requests less frequently or by lower the volume.

    You can look at this problem as a negotiation protocol. How can we (the service and the client) overcome on the problem of overflooding. The client needs to react on the service response otherwise the service will not process its requests.

    Rate gating is a proactive client side mechanism to self-restrict the outgoing requests either by frequency and/or volume. Rather than waiting for the service to tell us what to do the client can prevent the overflooding by limiting the in-flight requests.


    I tried to limit the number of concepts in this post because the problem domain is pretty broad. I intentionally did not talk about quotas, bulkhead, circuit breaker, load sharing, etc.. If you are interested there are a couple of great resource over the internet to learn about these as well.