Our team is trying to convert our Round Robin
tables in Azure Synapse to Hash
tables due to performance reasons. But all our current Round Robin
columns are partitioned based on a datetime field, say, loaddate.
But when it comes to Hash
distribution, we'll be using a different column for each of these tables depending on the data usage, ofcourse.
Before doing this, I want to understand if there will be any impacts of partitioning the tables on a particular standard field existing in all the tables, while the HASH distribution itself happens on a different column for each of those tables.
Can you please help me with some insights in this scenario?
Round Robin : A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random.
Hash : A hash-distributed table distributes table rows across the Compute nodes by using a deterministic hash function to assign each row to one distribution.
To explain my point of view here an example :
table sales:
load Date | Item | qte |
---|---|---|
04/07/2022 | 1 | 1 |
04/07/2022 | 1 | 2 |
04/07/2022 | 1 | 3 |
04/07/2022 | 1 | 2 |
04/07/2022 | 1 | 3 |
04/07/2022 | 2 | 1 |
04/07/2022 | 2 | 2 |
04/07/2022 | 2 | 3 |
04/07/2022 | 2 | 2 |
04/07/2022 | 2 | 3 |
case 1 : partitioning the tables on a particular standard field existing in all the tables. for example if we partition the data using the loaddate
in 2 partition we can have a result like this :
parition 1 :
load Date | Item | qte |
---|---|---|
04/07/2022 | 1 | 2 |
04/07/2022 | 1 | 3 |
04/07/2022 | 2 | 2 |
04/07/2022 | 2 | 3 |
04/07/2022 | 2 | 3 |
partition 2 :
load Date | Item | qte |
---|---|---|
04/07/2022 | 1 | 1 |
04/07/2022 | 1 | 2 |
04/07/2022 | 1 | 3 |
04/07/2022 | 2 | 1 |
04/07/2022 | 2 | 2 |
So in this case if we want to select Item, sum(qte) group by Item
none of the parition will have one Item data and there will be a lot of throughput and data transfer to get the result so here it's best to partition with the Item
case 2 : HASH distribution happens on a different column for each of those tables.
for the example table we can partition with the column Item and we will get two partitions like below:
partition 1 :
load Date | Item | qte |
---|---|---|
04/07/2022 | 1 | 1 |
04/07/2022 | 1 | 2 |
04/07/2022 | 1 | 3 |
04/07/2022 | 1 | 2 |
04/07/2022 | 1 | 3 |
parition 2 :
load Date | Item | qte |
---|---|---|
04/07/2022 | 2 | 1 |
04/07/2022 | 2 | 2 |
04/07/2022 | 2 | 3 |
04/07/2022 | 2 | 2 |
04/07/2022 | 2 | 3 |
and in this case we can execute the query directly select Item, sum(qte) group by Item
and it will more optimised.
So it's better to use the column parition depending on the sql request that you will perform.
Remarque : In case of a datawarehouse it preferable to use Replication
as distribution for the Dimensions tables this will enhance the joins operations