Search code examples
sqlsql-servert-sqlgaps-and-islandsscd2

Merge lines over timespan in SCD2 table


I'm having the following table sourced from a SCD2 table. From this source table, I have selected only a few columns, which reults in several lines looking exactly similar. I want to remove the unnecessary lines, those that contain the same data, and have the ValidFrom column showing the first value and ValidTo column showing the last value within "the timespan group".

Source data:

| Item     | Color      | ValidFrom     | ValidTo    |
| -------- | ---------- | ------------- | ---------- |
| Ball     | Red        | 2020-01-01    | 2020-03-24 |
| Ball     | Blue       | 2020-03-25    | 2020-04-12 |
| Ball     | Blue       | 2020-04-13    | 2020-05-07 |
| Ball     | Blue       | 2020-05-08    | 2020-11-14 |
| Ball     | Red        | 2020-11-15    | 9999-12-31 |
| Doll     | Yellow     | 2020-01-01    | 2020-03-24 |
| Doll     | Green      | 2020-03-25    | 2020-04-12 |
| Doll     | Green      | 2020-04-13    | 2020-05-07 |
| Doll     | Green      | 2020-05-08    | 2020-11-14 |
| Doll     | Pink       | 2020-11-15    | 9999-12-31 | 

What I want to accomplish is this:

| Item     | Color      | ValidFrom     | ValidTo    |
| -------- | ---------- | ------------- | ---------- |
| Ball     | Red        | 2020-01-01    | 2020-03-24 |
| Ball     | Blue       | 2020-03-25    | 2020-11-14 |
| Ball     | Red        | 2020-11-15    | 9999-12-31 |
| Doll     | Yellow     | 2020-01-01    | 2020-03-24 |
| Doll     | Green      | 2020-03-25    | 2020-11-14 |
| Doll     | Pink       | 2020-11-15    | 9999-12-31 | 

Note that the Item Ball at first has the color Red, then Blue and then goes back to Red. That makes things a bit more complicated, from what I have learned.

Thanks for your help.


Solution

  • Your data is very regular. You seem to just want to combine adjacent, tiled, records that have no overlaps or gaps. However the following handles gaps and more general overlaps:

    select item, color, min(validfrom), max(validto)
    from (select t.*,
                 sum(case when prev_validto >= dateadd(day, -1, validfrom)
                          then 0 else 1
                     end) over (partition by item order by validfrom) as grp
          from (select t.*,
                       lag(validto) over (partition by item, color order by validfrom) as prev_validto
                from t
                ) t
         ) t
    group by item, color, grp;
    

    You are looking for islands of rows in the original data where the "islands" have the same item, color, and adjacent dates. This determines where islands start by looking at the previous row for the same item and color. If there is no such row or the row ends before the current row begins, then the current row is the beginning of an island.

    The grp is then the cumulative sum of "island beginnings", and that can be used for aggregating and getting the final results.

    Your specific data is quite constrained -- perfectly tiled with one row ending the day before the next begins. You can do something very similar using left join:

    select item, color, min(validfrom), max(validto)
    from (select t.*,
                 sum(case when tprev.color is null then 1 else 0
                     end) over (partition by t.item order by t.validfrom) as grp
          from t left join
               t tprev
               on tprev.item = t.item and
                  tprev.color = t.color and
                  tprev.validto = dateadd(day, -1, t.validfrom)
         ) t
    group by item, color, grp
    order by item, min(validfrom);
    

    Here is a db<>fiddle illustrating both methods