I have two dataframes(from a delta lake table) that do a left join via an id column.
sd1, sd2
%sql
select
a.columnA,
b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id
The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY
Youtube video
according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.
But since the two dataframes use the id in the join condition, could it be interesting to order that column?
spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')
the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?
Thanks ind advance
OPTIMIZE ZORDER
may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER
relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.
You can also tune a file sizes, to avoid scanning of too many smaller files.
But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...