Modelling data from a single timestamp to a record with valid_from/valid_to timestamps when there is a change

Here is table1 with some example data:

id	date_column	col1	col2
1	06/03/2021	NULL	1
1	07/03/2021	NULL	1
1	08/03/2021	1	1
1	09/03/2021	1	2
2	05/03/2021	1	1
2	09/03/2021	1	1

I want to transform it into the following format:

id	valid_from	valid_to	col1	col2
1	06/03/2021	08/03/2021	NULL	1
1	08/03/2021	09/03/2021	1	1
1	09/03/2021	01/01/2100	1	2
2	05/03/2021	01/01/2100	1	1

So a new row in the desired format is created every time there is a new value in col1 or col2.

The valid_from is the the earliest value in date_column for this unique values in col1 and col2, while the the valid_to is the earliest value in date_column when any of these values have changed.

I was able to to achieve this transformation with the following SQL (Presto specific):

WITH base AS (
SELECT
*
FROM (
  VALUES
    (1, date('2021-03-06'), NULL, 1),
    (1, date('2021-03-07'), NULL, 1),
    (1, date('2021-03-08'), 1, 1),
    (1, date('2021-03-09'), 1, 2),
    (2, date('2021-03-05'), 1, 1),
    (2, date('2021-03-09'), 1, 1)
) AS t (id, date_column, col1, col2)
)

, base2 AS (
SELECT
  id
, date_column
, col1
, col2
, array_join(array[cast(col1 AS VARCHAR),
                   cast(col2 AS VARCHAR)], '','null') AS col_dedup
FROM
  base
)

, base3 AS (
SELECT
  id
, date_column
, col1
, col2

, coalesce(
    lag(col_dedup) OVER (PARTITION BY id  ORDER BY date_column) = col_dedup, 
    false
) AS same_as_previous

from base2
)

SELECT
  id
, date_column                                                                          AS valid_from
, lead(date_column, 1, date('2100-01-01')) OVER (PARTITION BY id ORDER BY date_column) AS valid_to
, col1
, col2
FROM
  base3
WHERE
  same_as_previous = false
ORDER BY
  id
, date_column ASC

The difficulty is when you have 100 columns, all of these 100 columns must appear in the array_join.

Now the actual question - is there a better way of doing the above transformation?

Solution

This is a type of gaps-and-islands problem . . . but actually a simple version. You want the first row of each grouping. Then lead() to get the end date:

select id, col1, col2, datecol as valid_from,
       lead(datecol, 1, '2100-01-01') over (partition by id order by datecol) as valid_to
from (select t1.*,
             lag(datecol) over (partition by id order by datecol) as prev_datecol,
             lag(datecol) over (partition by id, col1, col2 order by datecol1) as prev_datecol_12
      from table1 t1
     ) t1
where prev_datecol_12 is null or 
      (prev_datecol <> prev_datecol_12);

Note that this method does not require aggregation, which is typically faster.

More importantly, this handles groups where the values return to a previous set of values. Your method does not do that. I am guessing that this is what you really want for this type of problem.