Search code examples
sqljoinclickhouse

How to instruct ClickHouse to join with final sub query result


Say I have a ClickHouse cluster with 3 shards and I have distributed table and local table on each node. I have two sub queries q1 and q2 from distributed table. Can I instruct ClickHouse to perform join for the final result of q1 and q2?

In my understanding if I write query like below, the join happens on each node and the left table will be from the local table instead of distributed table.

with q1 as (select * from distributed_table ...) q2 as (select * from distributed_table ...) select * from q1 GLOBAL INNER JOIN q2 on(...)

The reason why I ask this is I have a use case where I need detect the sequence of events so I need use the global order of events when I join.


Solution

  • select * from q1 GLOBAL INNER JOIN q2  
    
    1. fetch the result of q2 from all nodes to the initiator at put into temporary table T
    2. resend temporary table T to all nodes of q1
    3. execute join locally of local q1 with T
    4. gather results at the initiator

    select * from (select * from q1) as q1' INNER JOIN q2  
    
    1. fetch the result of q2 from all nodes to the initiator
    2. fetch the result of q1 from all nodes to the initiator
    3. join q1 and q2 at the initiator

    select * from q1 INNER JOIN q2 settings distributed_product_mode='allow'
    
    1. each shard execute and fetch q2 to itself
    2. each shard join local q1 and q2 (from step 1)
    3. gather results at the initiator

    select * from q1 INNER JOIN q2 settings distributed_product_mode='local'
    
    1. each shard join local q1 and local q2 (both tables sharded the same way)
    2. gather results at the initiator