CosmosDb throws 429 Requests during updates of thousands of items

I'm quite new to CosmosDb but pretty quickly I encountered a problem I've never had when using MS SQL.

Currently we have an operation during which we update millions of CosmoDb items in batches of 1000 thousand. During this operation I noticed that about 25% of requests in App insights are being shown as 429ns. According to Microsoft documentation 1-5% of 429 requests is healthy and beyond that it's a problem.

Now I've tried different methods to lower this number without having to scale throughput but nothing helped. I'm not sure if this was a problem with my test implementation since I tried different things and could have made a mistake somewhere.

Test setup:

database with 5 different containers. Each of them scaled individually. Container on which the test was run has auto scaling on with 3000 RU/s Max throughtput
container has about 170 k assets with size of 0.5 Gb
CosmosClient is a Singleton
3000 items to update during the test in batches of 1000

Bath operation code:

var updateTasks = new List<Task>();

// queriedItems will have 1000 items in each batch
foreach (SomeCosmosbDbModel queriedItem in queriedItems)
{
    queriedItem.SomeProperty = someValue;
    updateTasks.Add(_repository.UpdateAsync(queriedItem));
}

await Task.WhenAll(updateTasks);

Update method code:

public Task UpdateAsync(TModel model)
{
    return Container.ReplaceItemAsync(model,
        model.Id,
        new PartitionKey(model.Partition),
        new ItemRequestOptions { EnableContentResponseOnWrite = false });
}

Here are the things I tried to lower number of 429ns and results I got:

Using Patch instead of Update operation. I thought it should be a little less heavy than Update operation -> No difference in % of 429s
Checking if partition key is not an issue. If I understand correctly how partitioning works if the database is small and number of RU/s is lower than 10.000 it should have no impact on this test - doc
Using BulkUpdate method -> Similar number of 429ns. It also caused some FailedReindexExceptions with TooManyRequests (429) errors and in this case some of the items actually didn't get updated (about 8%)
changing item indexing to none -> No impact on 429ns

Now it seems that all of the changes made could improve speed of the operation but each time it was hitting the limit of RUs on a container and in some situations it actually resulted in not performing some of the write operations and throwing exception.

Is there something I could do to lower the number of 429ns ? Or maybe I should double check some of the things I already tried if I didn't make a mistake during the test ? Also is using custom RetryPolicy recommended for production apps to improve resiliency ?

Solution

The updates require a certain amount of RU that you have little control off (apart from indexing strategy). So to reduce the amount of 429's your only option is to reduce your throughput.

If this is your only workload running on the database at that moment in time I wouldn't worry too much about it throwing 429's aslong as your RetryPolicy is capable of trying oft and long enough till it finally works. In theory your retry policy should be able to handle a request that fails near the start and throughout uploading all other documents; So your RetryPolicy should consist of plenty of retries and a timeout that exceeds the maximum time required to upload the batch.

If other processes are using your database as well it's probably better to look at limiting the amount of request you send, which is more complicated but also doable. Every response contains the RU's used, which allows you to very precisely tune your throughput by adding a Task.Delay(...) between requests.

I wouldn't worry too much about the 1-5% hint. That's mostly aimed for average database usage and not bulk imports.