I'm quite new to CosmosDb but pretty quickly I encountered a problem I've never had when using MS SQL.
Currently we have an operation during which we update millions of CosmoDb items in batches of 1000 thousand. During this operation I noticed that about 25% of requests in App insights are being shown as 429ns. According to Microsoft documentation 1-5% of 429 requests is healthy and beyond that it's a problem.
Now I've tried different methods to lower this number without having to scale throughput but nothing helped. I'm not sure if this was a problem with my test implementation since I tried different things and could have made a mistake somewhere.
Test setup:
Bath operation code:
var updateTasks = new List<Task>();
// queriedItems will have 1000 items in each batch
foreach (SomeCosmosbDbModel queriedItem in queriedItems)
{
queriedItem.SomeProperty = someValue;
updateTasks.Add(_repository.UpdateAsync(queriedItem));
}
await Task.WhenAll(updateTasks);
Update method code:
public Task UpdateAsync(TModel model)
{
return Container.ReplaceItemAsync(model,
model.Id,
new PartitionKey(model.Partition),
new ItemRequestOptions { EnableContentResponseOnWrite = false });
}
Here are the things I tried to lower number of 429ns and results I got:
Now it seems that all of the changes made could improve speed of the operation but each time it was hitting the limit of RUs on a container and in some situations it actually resulted in not performing some of the write operations and throwing exception.
Is there something I could do to lower the number of 429ns ? Or maybe I should double check some of the things I already tried if I didn't make a mistake during the test ? Also is using custom RetryPolicy recommended for production apps to improve resiliency ?
The updates require a certain amount of RU that you have little control off (apart from indexing strategy). So to reduce the amount of 429's your only option is to reduce your throughput.
If this is your only workload running on the database at that moment in time I wouldn't worry too much about it throwing 429's aslong as your RetryPolicy
is capable of trying oft and long enough till it finally works. In theory your retry policy should be able to handle a request that fails near the start and throughout uploading all other documents; So your RetryPolicy
should consist of plenty of retries and a timeout that exceeds the maximum time required to upload the batch.
If other processes are using your database as well it's probably better to look at limiting the amount of request you send, which is more complicated but also doable. Every response contains the RU's used, which allows you to very precisely tune your throughput by adding a Task.Delay(...)
between requests.
I wouldn't worry too much about the 1-5% hint. That's mostly aimed for average database usage and not bulk imports.