Search code examples
sqlsql-server-2016

SCD Type 2 - Handling Intraday changes?


I have a merge statement that builds my SCD type 2 table each night. This table must house all historical changes made in the source system and create a new row with the date from/date to columns populated along with the "islatest" flag. I have come across an issue today that I am not really sure how to handle.

There looks to have been multiple changes to the source table within a 24 hour period.

    ID      Code         PAN     EnterDate   Cost     Created
  16155 1012401593331   ENRD    2015-11-05  7706.3  2021-08-17 14:34
  16155 1012401593331   ENRD    2015-11-05  8584.4  2021-08-17 16:33

I use a basic merge statement to identify my changes however what would be the best approach to ensure all changes get picked up correctly? The above is giving me an error as it's trying to insert/update multiple rows with the same value

DECLARE @DateNow DATETIME = Getdate()
IF Object_id('tempdb..#meteridinsert') IS NOT NULL
DROP TABLE #meteridinsert;

CREATE TABLE #meteridinsert
             (
                          meterid INT,
                          change  VARCHAR(10)
             );
             
             
             MERGE
INTO         [DIM].[Meters] AS target
using        stg_meters     AS source
ON target.[ID] = source.[ID]
AND          target.latest=1
WHEN matched THEN
UPDATE
SET              target.islatest = 0,
                 target.todate = @Datenow
WHEN NOT matched BY target THEN
INSERT
       (
              id,
              code,
              pan,
              enterdate,
              cost,
              created,
              [FromDate] ,
              [ToDate] ,
              [IsLatest]
       )
       VALUES
       (
              source.id,
              source.code ,
              source.pan ,
              source.enterdate ,
              source.cost ,
              source.created ,
              @Datenow ,
              NULL ,
              1
       )
       output source.id,
       $action
INTO   #meteridinsert;INSERT INTO [DIM].[Meters]
            (
                        [id] ,
                        [code] ,
                        [pan] ,
                        [enterdate] ,
                        [cost] ,
                        [created] ,
                        [FromDate] ,
                        [ToDate] ,
                        [IsLatest]
            )
SELECT     ([id] ,[code] ,[pan] ,[enterdate] ,[cost] ,[created] , @DateNow  ,NULL ,1 FROM stg_meters a
INNER JOIN #meteridinsert cid
ON         a.id = cid.meterid
AND        cid.change = 'UPDATE'

Solution

  • Maybe you can do it using merge statement, but I would prefer to use typicall update and insert approach in order to make it easier to understand (also I am not sure that merge allows you to use the same source record for update and insert...)

    First of all I create the table dimscd2 to represent your dimension table

    create table dimscd2 
    (naturalkey int, descr varchar(100), startdate datetime, enddate datetime)
    

    And then I insert some records...

    insert into dimscd2 values
    (1,'A','2019-01-12 00:00:00.000', '2020-01-01 00:00:00.000'),
    (1,'B','2020-01-01 00:00:00.000', NULL)
    

    As you can see, the "current" is the one with descr='B' because it has an enddate NULL (I do recommend you to use surrogate keys for each record... This is just an incremental key for each record of your dimension, and the fact table must be linked with this surrogate key in order to reflect the status of the fact in the moment when happened).

    Then, I have created some dummy data to represent the source data with the changes for the same natural key

    -- new data (src_data)
    select 1 as naturalkey,'C' as descr, cast('2020-01-02 00:00:00.000' as datetime) as dt into src_data
    union all
    select 1 as naturalkey,'D' as descr, cast('2020-01-03 00:00:00.000' as datetime) as dt
    

    After that, I have created a temp table (##tmp) with this query to set the enddate for each record:

    -- tmp table
    select naturalkey, descr, dt, 
      lead(dt,1,0) over (partition by naturalkey order by dt) enddate,
      row_number() over (partition by naturalkey order by dt) rn
    into ##tmp  
    from src_data 
    

    lead_row_number

    The LEAD function takes the next start date for the same natural key, ordered by date (dt). The ROW_NUMBER marks with 1 the oldest record in the source data for the natural key in the dimension.

    Then, I proceed to close the "current" record using update

    update d
    set enddate = t.dt
    from dimscd2 d
      join ##tmp t
        on d.naturalkey = t.naturalkey
       and d.enddate is null
       and t.rn = 1
    

    And finally I add the new source data to the dimension with insert

    insert into dimscd2
    select naturalkey, descr, dt, 
      case enddate when '1900-00-00' then null else enddate end 
    from ##tmp
    

    Final result is obtained with the query:

    select * from dimscd2
    

    output

    You can test on this db<>fiddle