Search code examples
sql-serverazuremodelingazure-synapseuniqueidentifier

Use a uniqueidentifier column for hash partitioning


I have 2 fact tables: fact patient and fact production

Both use several columns with uniqueidentifier, I would like to know if it is a good practice and recommended to use this type of column as hash distribution for the tables (currently they are round-robin).

Expect know what column to use.


Solution

    • You should choose distribution column(s) in such a way that column(s) have many unique values and data are distributed evenly in all distributions.

    • You can use the below query to check the number of rows that are stored in each distribution for the chosen distribution column. Make sure the data is evenly distributed.

    -- Find data skew for a distributed table
    DBCC PDW_SHOWSPACEUSED('dbo.FactPatient');
    
    • It is recommended to have zero nulls or few null values in the distribution column. Also, that column should not be a date column.

    • The distribution column should be the column that is used mostly for joins, aggregation like group by, having. Columns which are used in where should not be chosen as the distribution column. It is also recommended to choose a distribution column that minimizes data movement.

    • It is not always possible to have both evenly distributed data and minimized data movement. Thus, find the right balance between minimizing data skew and data movement in choosing the distribution column.

    Reference: Distributed tables design guidance - Azure Synapse Analytics | Microsoft Learn