I have algorythm where I need to join two tables , first table has 40 millions of rows and second one has 300 000. Both tables were created with same ORDER BY so while joining I use SETTINGS join_algorithm = 'full_sorting_merge', max_bytes_before_external_sort = 0; It looks like following:
SELECT A.* , B.* FROM db1.table1 A -- 40 millions of rows
left join (select * from db1.table2) B -- 300 000 rows
on A.col1 = B.col1
SETTINGS join_algorithm = 'full_sorting_merge', max_bytes_before_external_sort = 0;
Now I need to add additional table db1.table3 to join with db1.table1
SELECT A.* , B.* FROM db1.table1 A -- 40 millions of rows
left join (select * from db1.table2) B -- 300 000 rows
on A.col1 = B.col1
left join (select * from db1.table3) C -- 400 000 rows
on A.col1 = C.col1
Where should I specify settings for join algrorythm ? After every join ? Or only in the end of query ? If I specify only single SETTINGS in the query end will be it propogated on both joins ? Is there ability to specify different algrorythm for every join in query ? Pls help
This way is acceptable for me :
select
D*,C.* from
(select * from db1.table1 A left join db1.table2 B on A.col1 = B.col1 Settings
SETTINGS join_algorithm = 'full_sorting_merge', max_bytes_before_external_sort = 0;
) D
left join
db1.table3 C
ON D.col1 = C.col1
Main goal was to not destroy fast loading of join result of A and B table while adding new table C , that is what I mentioned in question.
Before of adding table C I used join_algorithm = 'full_sorting_merge' between A and B table because it reduced execution time of select significantly and I had ability specify ORDER BY while table A and B creation to make them sorted in the same order. When I tried to use 'full_sorting_merge' inside single operation like SELECT * from A join B join C Settings join_algorithm = 'full_sorting_merge'
then I got memory issue problem ( which I had not while just joined A and B table). Using the way that I specified I don't get memory problems ( executed query more than 10 times and got same execution timings)
and execution time of that way is just slightly bigger that original query where I joined A and B table.
Just want to slighly change row count that I mentioned
Table A 43 683 753
Table B 334 194
Table C 794
I know that it is better to understand clickhouse internals more deeper to fully undestand nature of problem and memory consumption but just for fast fix maybe it will be usefull to someone