Search code examples
sqlt-sqlbusiness-intelligencedimension

Merge historical periods of an dimension entity into one


I have a Slowly Changing Dimension type 2 with rows that are identical (besides the Start and End date). How do I write a pretty SQL query to merge rows that identical and have connected time periods?

Current data

+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute  | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
|           1 | SomeValue           | 2019-01-01   | 2019-01-31 |
|           1 | SomeValue           | 2019-02-01   | 2019-02-28 |
|           1 | AnotherValue        | 2019-03-01   | 2019-03-31 |
|           1 | SomeValue           | 2019-04-01   | 2019-04-30 |
|           1 | SomeValue           | 2019-05-01   | 2019-05-31 |
|           2 | SomethingElse       | 2019-01-01   | 2019-01-31 |
|           2 | SomethingElse       | 2019-02-01   | 2019-02-28 |
|           2 | SomethingElse       | 2019-03-01   | 2019-03-31 |
|           2 | CompletelyDifferent | 2019-04-01   | 2019-04-30 |
|           2 | SomethingElse       | 2019-05-01   | 2019-05-31 |
+-------------+---------------------+--------------+------------+

Result

+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute  | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
|           1 | SomeValue           | 2019-01-01   | 2019-02-28 |
|           1 | AnotherValue        | 2019-03-01   | 2019-03-31 |
|           1 | SomeValue           | 2019-04-01   | 2019-05-31 |
|           2 | SomethingElse       | 2019-01-01   | 2019-03-31 |
|           2 | CompletelyDifferent | 2019-04-01   | 2019-04-30 |
|           2 | SomethingElse       | 2019-05-01   | 2019-05-31 |
+-------------+---------------------+--------------+------------+

Solution

  • For this version of the problem, I would use lag() to determine where the groups start, then a cumulative sum and aggregation:

    select dimensionid, DimensionAttribute,
           min(row_start_date), max(row_end_date)
    from (select t.*,
                 sum(case when prev_red = dateadd(day, -1, row_start_date)
                          then 0 else 1
                     end) over (partition by dimensionid, DimensionAttribute order by row_start_date) as grp
          from (select t.*, 
                       lag(row_end_date) over (partition by dimensionid, DimensionAttribute order by row_start_date) as prev_red
                from t 
               ) t
         ) t
    group by dimensionid, DimensionAttribute, grp;
    

    In particular, this will recognize gaps in the rows. It will only combine rows when they exactly fit together -- the previous end date is one day before the start date. This can be tweaked, of course, to allow a gap of 1 or 2 days or to allow overlaps.