Search code examples
azure-table-storageazure-tablequery

Do Azure Table Query 'OR' Predicates Cause a Scan?


Assume the following query:

PartitionKey==1 AND (RowKey==A OR RowKey==B)

or even this:

(PartitionKey==1 AND RowKey==A) OR (PartitionKey==2 AND RowKey==B)

Do either of these result in a scan?

I ask because this PDC talk indicates that it does:

https://channel9.msdn.com/Events/PDC/PDC09/SVC09

Specifically at 27:30 the speaker and slide states:

"OR" predicates on keys => no query optimization => results in scan

This is reiterated again at 29:30 and the recommendation is to use parallelized queries. After searching the internet, I also found this thread:

https://social.msdn.microsoft.com/forums/azure/en-us/d7765773-74b8-4860-b07c-b9731a2210c7/performance-of-range-queries-on-partition-keys-and-row-keys

"OR" in queries is not optimized at present but as I mentioned above, we have had feature requests for that.

Now all of this information is 7-8 years old and maybe it has changed. The previous link indicates it was a feature request and today it may have happened. Has it?

If it does result in a scan, why? Given the unique keys are known for both desired elements, what is the technical reason for a scan on one or more partitions?

Consider a scenario where one is trying to fetch hundreds/thousands of entities where only the Partition+RowKey of each entity is known in advance. Is it faster to do an individual query to fetch each entity in parallel, or use an 'OR' predicate to batch multiple per call to minimize the number of parallel queries?

Final question - Are there any performance implications when fetching multiple entities across different partitions in a single query (2nd query example at top of question)? Or is it better to keep the query in a single partition (1st query example at top of question)? The PDC talk at time 25:20 seems to indicate it is actually better to cause queries to distribute work to multiple partitions so that work can happen in parallel. However the discussion was directly speaking to range queries on partitions, not using the 'OR' predicate on two distinct partitions.

Thanks!

** EDIT **

I decided to test the performance myself. For the test I created a table with one partition and a million entities. The partition key was "0" and the row key is an incrementing value of "0" to "999999".

When running the test I did:

ServicePointManager.DefaultConnectionLimit = 1000;
servicePoint.UseNagleAlgorithm = false;
servicePoint.Expect100Continue = false;

In the first test I constructed a single query which looked like this:

(PK eq '0' and RK eq '0') or (PK eq '0' and RK eq '500') or ... x400

There were a total of 400 'OR' groups. The row key was incremented by 500 to spread out the request a bit.

It took Azure 3:26 to run this query.

Next I tested using a slightly more terse form of the same query:

PK eq '0' and (RK eq '0' or RK eq '500' or ... x400)

Again there were a total of 400 'OR'd expressions. And again the row key was incremented by 500 to spread out the request a bit.

It took Azure 1:19 to run this query. Better, but still completely unacceptable performance.

Finally, I tested with 400 parallel requests using TableOperation.Retrieve<>.

It took Azure on average just 3 seconds to run all of those queries in parallel.

Obviously, never use 'OR' in an expression and always use parallel queries. IMO the performance of 'OR' is unacceptable and Microsoft should just remove it if they will not optimize it.


Solution

  • PartitionKey==1 AND (RowKey==A OR RowKey==B) Do either of these result in a scan?

    Yes, it will result in a scan.

    The azure server will regard this query as partition scan. It will scan the partition equals with '1'.

    (PartitionKey==1 AND RowKey==A) OR (PartitionKey==2 AND RowKey==B) Do either of these result in a scan?

    The azure server will regard this query as table scan. It will scan all the table and find the partition key.

    Consider a scenario where one is trying to fetch hundreds/thousands of entities where only the Partition+RowKey of each entity is known in advance. Is it faster to do an individual query to fetch each entity in parallel, or use an 'OR' predicate to batch multiple per call to minimize the number of parallel queries?

    I suggest you could do an individual query to fetch each entity in parallel. Since the or query will result in table or range scan.

    Are there any performance implications when fetching multiple entities across different partitions in a single query (2nd query example at top of question)?

    As far as I know, across different partitions in a single query will cause table scan, it will very slow.

    Or is it better to keep the query in a single partition (1st query example at top of question)?

    As far as I know, the speed of the table scan is as below:

    Point Query(one partitionkey and row key) > Range Query(one partitionkey and range of rowkey) > Partition Scan(one partitionkey and range of property) > Table Scan (not include the PartitionKey)

    So it is better to keep the query in a single partition.