Search code examples
.netazure-cosmosdb

Handling unsuccessful documents upsert / delete via CosmosDB v3 Bulk API


I've migrated my application using Azure CosmosDB SDK to v3 (version 3.43.1) and I'm using the bulk functionality to upload somewhere between 200 and 1000 items.

Using the BulkExecutor and its configuration I was getting 100% success rate when uploading-- all documents were saved.

In V3 I'm getting 429 response, and only some documents are saved. I'm not sure how I should handle those that were not saved.

ClientOptions setup:

var options = new CosmosDbOptions
{
  AllowBulkExecution = true,
  MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(60),
  MaxRetryAttemptsOnRateLimitedRequests = 19,
}

I'm using the BulkOperations wrapper as described in the docs.

My main method for bulk uploading looks like this:

public async Task<BulkOperationResponse<TDocument>> ImportAsync(IEnumerable<TDocument> documents)
{
     var bulkOperations = new BulkOperations<TDocument>(documents.Count());

    foreach (var document in documents)
    {
        bulkOperations.Tasks.Add(CaptureOperationResponse(_container.Value.CreateItemAsync(document, new PartitionKey(document.PartitionKey)), document));
    }

    var response = await bulkOperations.ExecuteAsync();
    return response;
}

After inspecting the BulkOperationResponse<T> I often see that only a chunk of documents, were saved.

enter image description here

I have the same issue when trying to bulk delete documents via a method:

public async Task<BulkOperationResponse<TDocument>> DeleteAsync(IEnumerable<TDocument> documents)
{
    var bulkOperations = new BulkOperations<TDocument>(documents.Count());

    foreach (var document in documents)
    {
        bulkOperations.Tasks.Add(CaptureOperationResponse(_container.Value.DeleteItemAsync<TDocument>(document.Id, new PartitionKey(document.PartitionKey)), document));
    }

    var response = await bulkOperations.ExecuteAsync();

    return response;
}

I'm using a shared throughput for 4 containers - 400 RU/s.

How do I need to retry uploading of the failed documents? Does it need to be handled by the SDK, or should I retry in my code?


Solution

  • Thanks for your comments and answers. I analyzed how the code behaves in previous version vs. V3

    I noticed that upserting 100 documents in BulkExecutor took over 2 seconds and consumed over 2 000 RU.

    enter image description here

    In V3 it took 7 seconds, consumed almost 2 000 RU, while some of the documents failed.

    enter image description here

    So I decided to upscale before my code to a significantly larger throughput and downscale after the operation finishes. This gave me the desired results.

    enter image description here

    I did some reading up and there are more ways to handle it, such as:

    • switch from manual throughput to auto-scale
    • switch to serverless throughput

    But due to my setup, I decided to stick with manually increasing the throughput for the time of my operation.