Search code examples
azureazure-cosmosdbthrottlinghttp-status-code-429azure-cosmosdb-tables

How to debug/troubleshoot metadata DTU throttling on Azure Cosmos DB (Table API)?


We are using Azure cosmos DB for saving state information of a job processing pipeline. We use the table API and the corresponding SDK for this. Recently, we noticed that the system was frequently running into 429 – Request rate is too large error. Our transactional DTU utilization was way below the maximum configured on the table but we noticed under the metrics tab that system DTUs used by operations such as enumerating tables etc.. was being exhausted and hence the 429.

Our initial fix of removing a ‘CreateIfNotExists’ method call, helped fix it for a while but recently we have started running into the issue again (though not as frequently as before). It is difficult to debug/troubleshoot this since there I could not find enough documentation about which SDK method calls exhaust this non-scalable resource. I have enabled logging on our CosmosDB instance but I am not sure what I am looking for in the logs to troubleshoot this

Here is the singleton class we use for interfacing with Azure Cosmos DB

public class CosmosDbTableFacade : ICosmosDbTableFacade
{
        /// <summary>
        /// Initializes a new instance of the <see cref="CosmosDbTableFacade"/> class.
        /// </summary>
        /// <param name="connectionString">
        /// The connection string.
        /// </param>
        /// <param name="tableName">
        /// The table name.
        /// </param>
        public CosmosDbTableFacade(string connectionString)
        {
            var storageAccount = CloudStorageAccount.Parse(connectionString);
            this.CosmosTableClient = storageAccount.CreateCloudTableClient();
        }

        /// <summary>
        /// Gets or sets the cosmos table.
        /// </summary>
        public CloudTableClient CosmosTableClient { get; set; }

        /// <summary>
        /// The execute async.
        /// </summary>
        /// <param name="tableName">
        /// The table Name.
        /// </param>
        /// <param name="operation">
        /// The operation.
        /// </param>
        /// <returns>
        /// The <see cref="Task"/>.
        /// </returns>
        public Task<TableResult> ExecuteAsync(string tableName, TableOperation operation)
        {
            var table = this.CosmosTableClient.GetTableReference(tableName);
            return table.ExecuteAsync(operation);
        }

        /// <summary>
        /// The execute query segmented async.
        /// </summary>
        /// <param name="tableName">
        /// The table name.
        /// </param>
        /// <param name="query">
        /// The query.
        /// </param>
        /// <param name="continuationToken">
        /// The continuation token.
        /// </param>
        /// <returns>
        /// The <see cref="Task"/> which returns the list of entities.
        /// </returns>
        public Task<TableQuerySegment<DynamicTableEntity>> ExecuteQuerySegmentedAsync(string tableName, TableQuery query, TableContinuationToken continuationToken)
        {
            var table = this.CosmosTableClient.GetTableReference(tableName);
            return table.ExecuteQuerySegmentedAsync(query, continuationToken);
        }
}

The following snippet lists the different queries we are using -

public async Task InsertOrMergeEntityAsync<T>(string tableName, T entity)
            where T : TableEntity
{
            var insertOrMergeOperation = TableOperation.InsertOrMerge(entity);
            var result = await this.CosmosDbTableFacade.ExecuteAsync(tableName, insertOrMergeOperation).ConfigureAwait(false);
            ValidateCosmosTableResult(result, "Failed to write to Cosmos Table");
}

public async Task<T> GetEntityAsync<T>(string tableName, string partitionKey, string rowKey)
            where T : TableEntity
{
            var retrieveOperation = TableOperation.Retrieve<T>(partitionKey, rowKey);
            TableResult result = await this.CosmosDbTableFacade.ExecuteAsync(tableName, retrieveOperation).ConfigureAwait(false);
            ValidateCosmosTableResult(result, "Failed to read from Cosmos Table");
            return result.Result as T;
}

public async Task<IEnumerable<T>> GetEntitiesAsync<T>(string tableName, string filterCondition)
            where T : TableEntity
{
            var query = new TableQuery().Where(filterCondition);
            var continuationToken = default(TableContinuationToken);
            var results = new List<T>();
            do
            {
                var currentQueryResults = await this.CosmosDbTableFacade.ExecuteQuerySegmentedAsync(tableName, query, continuationToken).ConfigureAwait(false);
                results.AddRange(currentQueryResults.Select(currentQueryResult =>
                    {
                        var currentEntity = TableEntity.ConvertBack<T>(currentQueryResult.Properties, null);
                        currentEntity.RowKey = currentQueryResult.RowKey;
                        currentEntity.PartitionKey = currentQueryResult.PartitionKey;
                        currentEntity.Timestamp = currentQueryResult.Timestamp;
                        currentEntity.ETag = currentQueryResult.ETag;
                        return currentEntity;
                    }));
                continuationToken = currentQueryResults.ContinuationToken;
            }
            while (continuationToken != null);

            return results;
}

The filter in the last method below, contains a partition key and a custom column


Solution

  • For anyone running into similar issues, the root cause for metadata DTU throttling in my case turned out to be: GetTableReference(tableName) method (found by deploying a change with that line moved to startup code and monitoring the DTU utilization). I had this so that I could dynamically point to which table to read/write to at runtime but since this was consuming metadata DTUs, I changed my code to use a singleton for the table reference instead.