Search code examples
azure-data-explorerkql

Kusto shuffle strategy behavior with nested summarize/join


While improving the performance of a Kusto query, I came across the shuffle strategy for join/summarize. I can clearly see the performance benefits of this strategy for my query, which has high cardinality for the join/summarize key.

While reading the shuffle query Kusto documentation, it seemed that the strategy will be ignored when there are nested shuffle operators.

When the shuffled operator has other shuffle-able operators, like summarize or join, the query becomes more complex and then hint.strategy=shuffle won't be applied.

My query uses nested summarize and join (with shuffle) but I also clearly see performance gains. My query pattern:

Table1
| summarize hint.strategy=shuffle arg_max(Timestamp) by Device, Interface
| join hint.strategy=shuffle (Table2) on Device, Interface

Does a query like this benefit from shuffling?

Also, does the Kusto query planner avoid any problematic shuffle if present always? Basically I wanted to rest assured that there might only be perf issues with a wrongly used/authored shuffle and not data issues.


Solution

  • Please note that the article of shuffle query suggests to use hint.shufflekey in case you have nested summarize/join operators but it requires that the nested summarize/join operators have the same group-by/join key.

    So in your example above, apply the following (I'm assuming that Device has a high cardinality (and you can remove/keep the shuffle strategy from the summarize, keeping/removing it will be the same behavior as long as you specify the shuffle key on the join which wraps this summarize):

    Table1
    | summarize arg_max(Timestamp) by Device, Interface
    | join hint.shufflekey=Device (Table2) on Device, Interface