Handling unsuccessful documents upsert / delete via CosmosDB v3 Bulk API

I've migrated my application using Azure CosmosDB SDK to v3 (version 3.43.1) and I'm using the bulk functionality to upload somewhere between 200 and 1000 items.

Using the BulkExecutor and its configuration I was getting 100% success rate when uploading-- all documents were saved.

In V3 I'm getting 429 response, and only some documents are saved. I'm not sure how I should handle those that were not saved.

ClientOptions setup:

var options = new CosmosDbOptions
{
  AllowBulkExecution = true,
  MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(60),
  MaxRetryAttemptsOnRateLimitedRequests = 19,
}

I'm using the BulkOperations wrapper as described in the docs.

My main method for bulk uploading looks like this:

public async Task<BulkOperationResponse<TDocument>> ImportAsync(IEnumerable<TDocument> documents)
{
     var bulkOperations = new BulkOperations<TDocument>(documents.Count());

    foreach (var document in documents)
    {
        bulkOperations.Tasks.Add(CaptureOperationResponse(_container.Value.CreateItemAsync(document, new PartitionKey(document.PartitionKey)), document));
    }

    var response = await bulkOperations.ExecuteAsync();
    return response;
}

After inspecting the BulkOperationResponse<T> I often see that only a chunk of documents, were saved.

I have the same issue when trying to bulk delete documents via a method:

public async Task<BulkOperationResponse<TDocument>> DeleteAsync(IEnumerable<TDocument> documents)
{
    var bulkOperations = new BulkOperations<TDocument>(documents.Count());

    foreach (var document in documents)
    {
        bulkOperations.Tasks.Add(CaptureOperationResponse(_container.Value.DeleteItemAsync<TDocument>(document.Id, new PartitionKey(document.PartitionKey)), document));
    }

    var response = await bulkOperations.ExecuteAsync();

    return response;
}

I'm using a shared throughput for 4 containers - 400 RU/s.

How do I need to retry uploading of the failed documents? Does it need to be handled by the SDK, or should I retry in my code?

Solution

Thanks for your comments and answers. I analyzed how the code behaves in previous version vs. V3

I noticed that upserting 100 documents in BulkExecutor took over 2 seconds and consumed over 2 000 RU.

In V3 it took 7 seconds, consumed almost 2 000 RU, while some of the documents failed.

So I decided to upscale before my code to a significantly larger throughput and downscale after the operation finishes. This gave me the desired results.

I did some reading up and there are more ways to handle it, such as:

switch from manual throughput to auto-scale
switch to serverless throughput

But due to my setup, I decided to stick with manually increasing the throughput for the time of my operation.