Search code examples
azure-cosmosdb

Throughput Configuration in Cosmos DB


I am planning to switch from Azure SQL Server to CosmosDB. I am reading around 27-30 Million data every day for processing. Here's how I wanted to execute the things:

  1. Reading data from Kafka and storing it in CosmosDB throughout the day
  2. Read data from Cosmos, perform some arithmetic calculations and save the calculated data back in different containers.

Basically, I have 2 types of JSON files (reading from Kafka)

  • Json1 - size is 70B ( Kafka sends this throughout the day. 1-6 times per day)
  • Json2 - size is 1KB ( Kafka sends this once per day )

I need help in understanding the required throughput & throughput mode selection for this scenario. Please guide.


Solution

  • First, you need to work out approx. how many reads and writes per second will be processed and stored in Cosmos DB at given times of the day (request units are the “base currency” of Cosmos DB – can’t even begin sizing without some idea of this).

    You also need to know what your data retention is going to be once any historic data has been migrated (for storage costs).

    Once you have these figures, you can start to plug those numbers into our capacity calculator to give a reasonable estimate

    You can also consult this article for deciding between the standard and autoscale “throughput modes”: https://learn.microsoft.com/azure/cosmos-db/how-to-choose-offer

    Regarding Kafka – exactly how is this being used?

    • If being used for event sourcing between Azure SQL DB-backed microservices (or similar), would recommend using the change feed in Cosmos DB directly (see patterns here).
    • If messages are coming from an external source through Kafka, you will want to check out the Kafka connector documentation.