While improving the performance of a Kusto query, I came across the shuffle strategy for join/summarize. I can clearly see the performance benefits of this strategy for my query, which has high cardinality for the join/summarize key.
While reading the shuffle query Kusto documentation, it seemed that the strategy will be ignored when there are nested shuffle operators.
When the shuffled operator has other shuffle-able operators, like summarize or join, the query becomes more complex and then hint.strategy=shuffle won't be applied.
My query uses nested summarize and join (with shuffle) but I also clearly see performance gains. My query pattern:
Table1
| summarize hint.strategy=shuffle arg_max(Timestamp) by Device, Interface
| join hint.strategy=shuffle (Table2) on Device, Interface
Does a query like this benefit from shuffling?
Also, does the Kusto query planner avoid any problematic shuffle if present always? Basically I wanted to rest assured that there might only be perf issues with a wrongly used/authored shuffle and not data issues.
Please note that the article of shuffle query suggests to use hint.shufflekey
in case you have nested summarize/join operators but it requires that the nested summarize/join operators have the same group-by/join key.
So in your example above, apply the following (I'm assuming that Device has a high cardinality (and you can remove/keep the shuffle strategy from the summarize, keeping/removing it will be the same behavior as long as you specify the shuffle key on the join which wraps this summarize):
Table1
| summarize arg_max(Timestamp) by Device, Interface
| join hint.shufflekey=Device (Table2) on Device, Interface