sql join amazon-redshift query-performance

Redshift tuning join

I'm trying to join two tables in redshift. One big and one small. The join is by their id, and I've distributed the big one on the cluster by the column (used for the join) and I used as sortkey this column too. The small one table I have distributed whole on all nodes and used sortkey with the column used for the join.

Example:

create table table_small diststyle all SORTKEY(id) as select * from another_small_table;

create table big_table distkey (id) diststyle key SORTKEY(id) as SELECT * from another_big_table;

explain SELECT * FROM big_table big JOIN small_table small ON big.id = small.id;

The query plan say that redshift is doing a Hash join instead of merge join. This is the expected behaviour? I expected merge join.

Solution

According to

https://docs.aws.amazon.com/redshift/latest/dg/c-the-query-plan.html

Merge Join

Typically the fastest join, a merge join is used for inner joins and outer joins. The merge join is not used for full joins. This operator is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted. It reads two sorted tables in order and finds the matching rows. To view the percent of unsorted rows, query the SVV_TABLE_INFO system table.

So - as you do not have distribution key of (id) then this join type will not be used.