Search code examples
sqlperformancejoinclickhouse

Clickhouse joining of tables


I have algorythm where I need to join two tables , first table has 40 millions of rows and second one has 300 000. Both tables were created with same ORDER BY so while joining I use SETTINGS join_algorithm = 'full_sorting_merge', max_bytes_before_external_sort = 0; It looks like following:

SELECT A.* , B.*  FROM db1.table1 A  -- 40 millions of rows
left join (select * from db1.table2) B  -- 300 000 rows
on A.col1 = B.col1
SETTINGS join_algorithm = 'full_sorting_merge', max_bytes_before_external_sort = 0; 

Now I need to add additional table db1.table3 to join with db1.table1

SELECT A.* , B.*  FROM db1.table1 A  -- 40 millions of rows
left join (select * from db1.table2) B  -- 300 000 rows
on A.col1 = B.col1
left join (select * from db1.table3) C  -- 400 000 rows
on A.col1 = C.col1

Where should I specify settings for join algrorythm ? After every join ? Or only in the end of query ? If I specify only single SETTINGS in the query end will be it propogated on both joins ? Is there ability to specify different algrorythm for every join in query ? Pls help


Solution

  • This way is acceptable for me :

    select 
    D*,C.* from 
    (select * from db1.table1 A left join db1.table2 B on A.col1 = B.col1 Settings  
    SETTINGS join_algorithm = 'full_sorting_merge', max_bytes_before_external_sort = 0;
    ) D
    left join 
    db1.table3 C 
    ON D.col1 = C.col1
    

    Main goal was to not destroy fast loading of join result of A and B table while adding new table C , that is what I mentioned in question. Before of adding table C I used join_algorithm = 'full_sorting_merge' between A and B table because it reduced execution time of select significantly and I had ability specify ORDER BY while table A and B creation to make them sorted in the same order. When I tried to use 'full_sorting_merge' inside single operation like SELECT * from A join B join C Settings join_algorithm = 'full_sorting_merge' then I got memory issue problem ( which I had not while just joined A and B table). Using the way that I specified I don't get memory problems ( executed query more than 10 times and got same execution timings) and execution time of that way is just slightly bigger that original query where I joined A and B table.

    Just want to slighly change row count that I mentioned

    Table A 43 683 753
    Table B    334 194
    Table C        794
    

    I know that it is better to understand clickhouse internals more deeper to fully undestand nature of problem and memory consumption but just for fast fix maybe it will be usefull to someone