Search code examples
sql-servert-sqltriggersazure-sql-databasehistorical-db

Change tracking/history keeping in SQL database using TSQL script in single table


For context I have a simple Azure SQL database and I have a theoretical dimensional model with like 12 dimensions and some fact tables that eventually should function as a DWH.

I am looking to apply historization on this dimenional model. To start simple and apply the concept of historization I want to enable it on one dimension. In my case this is the DimEmployee which looks like this with a row of sample data (it has more columns but for simplicity sake I only take these)

EmpKey EmpName EmpCity DWHDatStart DWHDatEnd IsActive
52 John London 02/02/2020 31/12/9999 Y

Now if lets say John moves to a different city I want this to be tracked so the wanted situation when the value of EmpCity is updated would be

EmpKey EmpName EmpCity DWHDatStart DWHDatEnd IsActive
52 John Amsterdam 08/06/2021 31/12/9999 Y
52 John London 02/02/2020 08/06/2021 N

How would I go about applying this with TSQL. I assume it will have to be a trigger of some sorts but lack the specific knowledge to apply this. I've read about temporal tables and that usually when enabling historization on data you store the historical data in a separate history table. I think because of this models design there can always be expanded on this concept by adding DimEmployeeHist dimensions or whatever but for the first concept this is not needed. I don't know what value it would bring doing it in separate tables other then when you have lot's of history records so you want them all neatly stored in one place to not clutter up your main tables.

Does this have to be done in the create table script or can this be done after the table is created?


Solution

  • Firstly: forget about triggers. Triggers are bad

    This is a braindump on how I generally do this. There are many permutations on this but this should give you an idea.

    This is a row in the target dimension. Of course there are many other rows

    SurrogateKey SourceKey SourceSystem EmpName EmpCity DWHDatStart DWHDatEnd IsActive
    3678 52 HRSystem1 John London 2020-02-20 9999-12-31 Y
    3642 73 HRSystem1 Jim Brisbane 2021-03-18 9999-12-31 Y
    • SurrogateKey is the key that joins to your fact and is unique in the dimension table (and should be enforced with a constraint or index)
    • SourceKey is the key in the source system
    • SourceSystem is a code for whatever system provided this record.
    • SourceKey + SourceSystem is unique (and should be enforced with a constraint or index)

    We load data into our datawarehouse. Typically the first step is to load the input data into a staging table. So say we have this in a staging table:

    EmpKey EmpName EmpCity ActiveRecordSurrogateKey Ignore
    52 John Amsterdam NULL NULL
    73 Jim Brisbane NULL NULL
    7 Jack Texas NULL NULL

    ActiveRecord and Ignore are working columns, they don't come from the source. Every other column does come from the source but knows nothing about the dimension

    There will obviously be many records in the dimension and staging table.

    First exclude all records that are up to date in the dimension.

    UPDATE StagingTable
    SET Ignore = 'Y'
    FROM StagingTable TGT
    INNER JOIN DimensionTable SRC
    ON TGT.EmpKey = SRC.SourceKey
    AND TGT.SourceKey = 'HRSystem1'
    AND TGT.EmpName=SRC.EmpName 
    AND TGT.EmpCity=SRC.EmpCity
    AND SRC.IsActive = 'Y'
    

    We've identified that Jim has not changed and can be ignored

    EmpKey EmpName EmpCity ActiveRecordSurrogateKey Ignore
    52 John Amsterdam NULL NULL
    73 Jim Brisbane NULL Y
    7 Jack Texas NULL NULL

    identify all the records in the staging table that already have an active record in the dimension but have different attributes

    UPDATE StagingTable
    SET ActiveRecordSurrogateKey = SRC.SurrogateKey
    FROM StagingTable TGT
    INNER JOIN DimensionTable SRC
    ON TGT.EmpKey = SRC.SourceKey
    AND TGT.SourceKey = 'HRSystem1'
    AND TGT.IsActive='Y'
    AND (TGT.EmpName<>SRC.EmpName OR TGT.EmpCity<>SRC.EmpCity)
    

    (These two seperate updates could be combined into one if you want.)

    Now our staging table looks like this.

    EmpKey EmpName EmpCity ActiveRecordSurrogateKey Ignore
    52 John Amsterdam 3678 NULL
    73 Jim Brisbane NULL Y
    7 Jack Texas NULL NULL

    Now we have enough info to update the dimension. We can write some more SQL to apply to the dimension based on the helper columns.

    But first, Lets specify a fixed date. This stops weird things happening if this runs before and after midnight. Or you might want to determine this some other way, like an input parameter or data in the source

    DECLARE @Date DATE = GETDATE();
    

    Now we can insert all new records (brand new or changed)

    -- This line inserts new dimension records:
    INSERT INTO DimensionTable (SourceKey,SourceSystem,EmpName,EmpCity, StartDate,EndDate,IsActive)
    SELECT EmpKey,'HRSystem1',EmpName,EmpCity, @Date,'2999-01-01','Y' 
    FROM StagingTable 
    WHERE Ignore IS NULL
    

    Now our dimension looks like this

    SurrogateKey SourceKey SourceSystem EmpName EmpCity DWHDatStart DWHDatEnd IsActive
    3678 52 HRSystem1 John London 2020-02-20 9999-12-31 Y
    3642 73 HRSystem1 Jim Brisbane 2021-03-18 9999-12-31 Y
    3693 7 HRSystem1 Jack Texas 2021-06-09 9999-12-31 Y
    3694 52 HRSystem1 John Amsterdam 2021-06-09 9999-12-31 Y

    Now we end date existing records:

    -- This line end-dates existing records:
    UPDATE DimensionTable
    SET DWEndDate = @Date, Active = 'N'
    FROM DimensionTable TGT
    INNER JOIN StagingTable SRC
    ON TGT.SurrogateKey  = ActiveRecordSurrogateKey
    
    SurrogateKey SourceKey SourceSystem EmpName EmpCity DWHDatStart DWHDatEnd IsActive
    3678 52 HRSystem1 John London 2020-02-20 2021-06-09 N
    3642 73 HRSystem1 Jim Brisbane 2021-03-18 9999-12-31 Y
    3693 52 HRSystem1 Jack Texas 2021-06-09 9999-12-31 Y
    3694 52 HRSystem1 John Amsterdam 2021-06-09 9999-12-31 Y

    So you basically wrap all of those T-SQL statements up in one stored procedure, add some transactions, logging, and error handling.

    CREATE PROC pUpdateDimPerson
    AS
    BEGIN
    
    -- All the code above
    
    END
    

    The stored procedure compares source (staging) and target (dimension) and does all the right things.

    There are a bunch of other things to consider but this gives you an idea.