Search code examples
sqldatabaseprestotrino

Modelling data from a single timestamp to a record with valid_from/valid_to timestamps when there is a change


Here is table1 with some example data:

id date_column col1 col2
1 06/03/2021 NULL 1
1 07/03/2021 NULL 1
1 08/03/2021 1 1
1 09/03/2021 1 2
2 05/03/2021 1 1
2 09/03/2021 1 1

I want to transform it into the following format:

id valid_from valid_to col1 col2
1 06/03/2021 08/03/2021 NULL 1
1 08/03/2021 09/03/2021 1 1
1 09/03/2021 01/01/2100 1 2
2 05/03/2021 01/01/2100 1 1

So a new row in the desired format is created every time there is a new value in col1 or col2.

The valid_from is the the earliest value in date_column for this unique values in col1 and col2, while the the valid_to is the earliest value in date_column when any of these values have changed.

I was able to to achieve this transformation with the following SQL (Presto specific):

WITH base AS (
SELECT
*
FROM (
  VALUES
    (1, date('2021-03-06'), NULL, 1),
    (1, date('2021-03-07'), NULL, 1),
    (1, date('2021-03-08'), 1, 1),
    (1, date('2021-03-09'), 1, 2),
    (2, date('2021-03-05'), 1, 1),
    (2, date('2021-03-09'), 1, 1)
) AS t (id, date_column, col1, col2)
)

, base2 AS (
SELECT
  id
, date_column
, col1
, col2
, array_join(array[cast(col1 AS VARCHAR),
                   cast(col2 AS VARCHAR)], '','null') AS col_dedup
FROM
  base
)

, base3 AS (
SELECT
  id
, date_column
, col1
, col2

, coalesce(
    lag(col_dedup) OVER (PARTITION BY id  ORDER BY date_column) = col_dedup, 
    false
) AS same_as_previous

from base2
)

SELECT
  id
, date_column                                                                          AS valid_from
, lead(date_column, 1, date('2100-01-01')) OVER (PARTITION BY id ORDER BY date_column) AS valid_to
, col1
, col2
FROM
  base3
WHERE
  same_as_previous = false
ORDER BY
  id
, date_column ASC

The difficulty is when you have 100 columns, all of these 100 columns must appear in the array_join.

Now the actual question - is there a better way of doing the above transformation?


Solution

  • This is a type of gaps-and-islands problem . . . but actually a simple version. You want the first row of each grouping. Then lead() to get the end date:

    select id, col1, col2, datecol as valid_from,
           lead(datecol, 1, '2100-01-01') over (partition by id order by datecol) as valid_to
    from (select t1.*,
                 lag(datecol) over (partition by id order by datecol) as prev_datecol,
                 lag(datecol) over (partition by id, col1, col2 order by datecol1) as prev_datecol_12
          from table1 t1
         ) t1
    where prev_datecol_12 is null or 
          (prev_datecol <> prev_datecol_12);
    

    Note that this method does not require aggregation, which is typically faster.

    More importantly, this handles groups where the values return to a previous set of values. Your method does not do that. I am guessing that this is what you really want for this type of problem.