Search code examples
clickhouse

what is the effect of distributed_group_by_no_merge


I know that the distributed node does not combine intermediate results from shards by using distributed_group_by_no_merge.

The following SQL

select sum(xxxxx),xxxxx from (
    select sum(xxxx),xxxx 
    from (
        select count(xxx),xxx 
        from distributed_table group by xxx )  
    group by xxxx SETTINGS distributed_group_by_no_merge = 1
) group by xxxxx

I want to know that which part of sql will be sent to MergeTree node to execute by using distributed_group_by_no_merge? is it?select count(xxx),xxx from distributed_table group by xxx ) group by xxxx SETTINGS distributed_group_by_no_merge = 1

how does the parameter of distributed_group_by_no_merge change the behavior of distributed query?which part of sql execute on MergeTree node and which part of sql execute on distributed node?


Solution

  • distributed_group_by_no_merge-param affects the way how the initiator-node (it is a node which runs distributed query) will form the final result of a distributed query:

    • either by merging aggregated intermediate states coming from shards by itself (it required copying full aggregated intermediate states from shards to initiator-node) [distributed_group_by_no_merge = 0 (default mode)]

    • or get already final result from shards (when each shard merges an intermediate aggregation state on its side and send to initiator-node only the final result). It provides a significant improvement in performance and resource consumption but requires the right selection of the sharding key [distributed_group_by_no_merge = 1]


    I would put distributed_group_by_no_merge at the same level of subquery where defined distributed table to explicitly define your intention and avoid confusion when there are several distributed-subqueries.


    Let's look at the way how to check the differences between the two modes (will use _shard_num-virtual column):

    1. distributed_group_by_no_merge=0
    SELECT
        groupUniqArray(_shard_num) AS shards,
        ..
    FROM table
    WHERE ..
    GROUP BY ..
    SETTINGS distributed_group_by_no_merge = 0
    
    /* Aggregated states were merged into ONE result set on initiator-node.
    ┌─shards────┬─ ..
    │ [2, 1, 3] │  ..
    └───────────┴─ ..
    */
    
    1. distributed_group_by_no_merge=1
    SELECT
        groupUniqArray(_shard_num) AS shards,
        ..
    FROM table
    WHERE ..
    GROUP BY ..
    SETTINGS distributed_group_by_no_merge = 1
    
    /* Get a set of final results (not aggregated states) from each shard. They should be unioned manually.
    ┌─shards─┬─ ..
    │ [2]    │  ..
    │ [1]    │  ..
    │ [3]    │  ..
    └────────┴─ ..
    */
    

    See https://clickhouse.com/docs/en/operations/settings/settings#distributed_group_by_no_merge.

    How to avoid merging high cardinality sub-select aggregations on distributed tables