Search code examples
databaseazurepartitioningazure-synapse

What's the impact of having a partition column different from the column used for HASH distribution in Azure Synapse?


Our team is trying to convert our Round Robin tables in Azure Synapse to Hash tables due to performance reasons. But all our current Round Robin columns are partitioned based on a datetime field, say, loaddate.

But when it comes to Hash distribution, we'll be using a different column for each of these tables depending on the data usage, ofcourse.

Before doing this, I want to understand if there will be any impacts of partitioning the tables on a particular standard field existing in all the tables, while the HASH distribution itself happens on a different column for each of those tables.

Can you please help me with some insights in this scenario?


Solution

  • Round Robin : A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random.

    Hash : A hash-distributed table distributes table rows across the Compute nodes by using a deterministic hash function to assign each row to one distribution.

    To explain my point of view here an example :

    table sales:

    load Date Item qte
    04/07/2022 1 1
    04/07/2022 1 2
    04/07/2022 1 3
    04/07/2022 1 2
    04/07/2022 1 3
    04/07/2022 2 1
    04/07/2022 2 2
    04/07/2022 2 3
    04/07/2022 2 2
    04/07/2022 2 3

    case 1 : partitioning the tables on a particular standard field existing in all the tables. for example if we partition the data using the loaddate in 2 partition we can have a result like this :

    parition 1 :

    load Date Item qte
    04/07/2022 1 2
    04/07/2022 1 3
    04/07/2022 2 2
    04/07/2022 2 3
    04/07/2022 2 3

    partition 2 :

    load Date Item qte
    04/07/2022 1 1
    04/07/2022 1 2
    04/07/2022 1 3
    04/07/2022 2 1
    04/07/2022 2 2

    So in this case if we want to select Item, sum(qte) group by Item none of the parition will have one Item data and there will be a lot of throughput and data transfer to get the result so here it's best to partition with the Item

    case 2 : HASH distribution happens on a different column for each of those tables.

    for the example table we can partition with the column Item and we will get two partitions like below:

    partition 1 :

    load Date Item qte
    04/07/2022 1 1
    04/07/2022 1 2
    04/07/2022 1 3
    04/07/2022 1 2
    04/07/2022 1 3

    parition 2 :

    load Date Item qte
    04/07/2022 2 1
    04/07/2022 2 2
    04/07/2022 2 3
    04/07/2022 2 2
    04/07/2022 2 3

    and in this case we can execute the query directly select Item, sum(qte) group by Item and it will more optimised.

    So it's better to use the column parition depending on the sql request that you will perform.

    Remarque : In case of a datawarehouse it preferable to use Replication as distribution for the Dimensions tables this will enhance the joins operations