azure-storage azure-table-storage azure-tablequery

Suggestions to improve Azure Table Storage query performance

We have a table in Azure Table Storage which currently has 50,000 items since it is newly implemented.

PartitionKey: DateTime value in form of string
RowKey: numeric value in form of string

We use TableQuery to generate a filter condition. The PartitionKey filter is something like: PartitionKey ge '201801240000000000' && "PartitionKey lt '201806220000000000'"

Unfortunately, we cannot use a RowKey filter because we want data between two dates.

To fetch the data for around a month, it takes around 5 seconds. And to fetch for around 3 months, it takes more time than that.

Though we have a caching strategy in place, fetching the data the first time takes a long time. Just like it takes a long time when the filter on the data changes on the date.

Any suggestions to improve performance would be appreciated.

Solution

As far as I can see from your post, the biggest issue you have is that your query spans multiple partitions in one query. This is not optimal for performance. Based on the below list, you're somewhere between Partition Scan and Table Scan, since you are specifying the partition key, but you're using multiple of them.

A Point Query is the most efficient lookup to use and is recommended to be used for high-volume lookups or lookups requiring lowest latency. Such a query can use the indexes to locate an individual entity very efficiently by specifying both the PartitionKey and RowKey values. For example: $filter=(PartitionKey eq 'Sales') and (RowKey eq '2')

Second best is a Range Query that uses the PartitionKey and filters on a range of RowKey values to return more than one entity. The PartitionKey value identifies a specific partition, and the RowKey values identify a subset of the entities in that partition. For example: $filter=PartitionKey eq 'Sales' and RowKey ge 'S' and RowKey lt 'T'

Third best is a Partition Scan that uses the PartitionKey and filters on another non-key property and that may return more than one entity. The PartitionKey value identifies a specific partition, and the property values select for a subset of the entities in that partition. For example: $filter=PartitionKey eq 'Sales' and LastName eq 'Smith'

A Table Scan does not include the PartitionKey and is very inefficient because it searches all of the partitions that make up your table in turn for any matching entities. It will perform a table scan regardless of whether or not your filter uses the RowKey. For example: $filter=LastName eq 'Jones'

Queries that return multiple entities return them sorted in PartitionKey and RowKey order. To avoid resorting the entities in the client, choose a RowKey that defines the most common sort order.

Source: Azure Storage Table Design Guide: Designing Scalable and Performant Tables

Another very useful article is this one: What PartitionKey and RowKey are for in Windows Azure Table Storage, especially when you look at the this image:

Based on the size and load of a partition, partitions are fanned out across machines. Whenever a partition gets a high load or grows in size, the Windows Azure storage management can kick in and move a partition to another machine:

Edit:
If there are multiple ways you would like to query your data, think about storing them in multiple ways. Especially since storage is cheap, storing data multiple times is not that bad. This way you optimize for read. This is what's known as the Materialized View pattern which can "help support efficient querying and data extraction, and improve application performance".

However, you should keep in mind that this is simple for static data. If you have data that changes around a lot, keeping them in sync when storing it multiple times might become a hassle.