Here is table1 with some example data:
id | date_column | col1 | col2 |
---|---|---|---|
1 | 06/03/2021 | NULL | 1 |
1 | 07/03/2021 | NULL | 1 |
1 | 08/03/2021 | 1 | 1 |
1 | 09/03/2021 | 1 | 2 |
2 | 05/03/2021 | 1 | 1 |
2 | 09/03/2021 | 1 | 1 |
I want to transform it into the following format:
id | valid_from | valid_to | col1 | col2 |
---|---|---|---|---|
1 | 06/03/2021 | 08/03/2021 | NULL | 1 |
1 | 08/03/2021 | 09/03/2021 | 1 | 1 |
1 | 09/03/2021 | 01/01/2100 | 1 | 2 |
2 | 05/03/2021 | 01/01/2100 | 1 | 1 |
So a new row in the desired format is created every time there is a new value in col1 or col2.
The valid_from is the the earliest value in date_column for this unique values in col1 and col2, while the the valid_to is the earliest value in date_column when any of these values have changed.
I was able to to achieve this transformation with the following SQL (Presto specific):
WITH base AS (
SELECT
*
FROM (
VALUES
(1, date('2021-03-06'), NULL, 1),
(1, date('2021-03-07'), NULL, 1),
(1, date('2021-03-08'), 1, 1),
(1, date('2021-03-09'), 1, 2),
(2, date('2021-03-05'), 1, 1),
(2, date('2021-03-09'), 1, 1)
) AS t (id, date_column, col1, col2)
)
, base2 AS (
SELECT
id
, date_column
, col1
, col2
, array_join(array[cast(col1 AS VARCHAR),
cast(col2 AS VARCHAR)], '','null') AS col_dedup
FROM
base
)
, base3 AS (
SELECT
id
, date_column
, col1
, col2
, coalesce(
lag(col_dedup) OVER (PARTITION BY id ORDER BY date_column) = col_dedup,
false
) AS same_as_previous
from base2
)
SELECT
id
, date_column AS valid_from
, lead(date_column, 1, date('2100-01-01')) OVER (PARTITION BY id ORDER BY date_column) AS valid_to
, col1
, col2
FROM
base3
WHERE
same_as_previous = false
ORDER BY
id
, date_column ASC
The difficulty is when you have 100 columns, all of these 100 columns must appear in the array_join
.
Now the actual question - is there a better way of doing the above transformation?
This is a type of gaps-and-islands problem . . . but actually a simple version. You want the first row of each grouping. Then lead()
to get the end date:
select id, col1, col2, datecol as valid_from,
lead(datecol, 1, '2100-01-01') over (partition by id order by datecol) as valid_to
from (select t1.*,
lag(datecol) over (partition by id order by datecol) as prev_datecol,
lag(datecol) over (partition by id, col1, col2 order by datecol1) as prev_datecol_12
from table1 t1
) t1
where prev_datecol_12 is null or
(prev_datecol <> prev_datecol_12);
Note that this method does not require aggregation, which is typically faster.
More importantly, this handles groups where the values return to a previous set of values. Your method does not do that. I am guessing that this is what you really want for this type of problem.