Search code examples
sqletlprimary-keysurrogate-keysql-data-warehouse

How to replace primary key with surrogate keys during ETL?


Have a question that is haunting me for some time.

How in practice looks replacing primary keys with surrogate keys during the ETL process? Like what is the workflow - is it just assigning new IDENTITY? If so, how about previous values, how to replace existing business keys with newly created ones?

In my mind a specific workflow looks like below, but I haven't done it in practice yet:

  1. Drop existing PK_Product and FK_Product in DimProduct and FactSales tables.
  2. Set a new IDENTITY column to dimProduct.
  3. Add new column to FactSales with values from newly created IDENTITY column based on join on previous business key.
  4. Drop an old ProductKey columns in both tables.
  5. Add constraints for newly created surrogate IDENTITY keys.
  6. Assign reference between tables for future-coming values.

But please tell me how you do this in your job and correct me, because I think I'm wrong.


Solution

  • Let's take the simplest case where your target dimension is being loaded from a single source system. The basic steps would be:

    1. Take the unique identifier for the source system record - normally either the PK or BK

    2. Use this identifier to lookup the corresponding record in the target dimension - which holds this identifier as well as the SK and other attributes - and return the SK if a record is found in the Dim

    3. If an SK is found then you are going to perform an Update on the Dim using the SK as the primary identifier

      a. You may also need to perform an insert e.g. if the Dim is SCD2

      b. If there have been no changes between the source and target record you may decide not to process the source record

    4. If no SK is found then you will insert a new record into the target Dim, generating a new SK value in one of two main ways:

      a. Using the capabilities of the underlying database, such as sequences, auto-increment columns, etc.

      b. Using the capabilities of your ETL tool e.g. a sequence generator

    These are obviously the logically steps you need to follow. How you actually implement them depends entirely on your ETL/ELT components - so running a merge command in your DB will look very different from an Informatica workflow but "under the covers" both processes are following the same logical steps