I want to replace NULL
values with a specified string. However, I want to make this replacement only for NULL
values coming after the first non-NULL
. Meaning, if a NULL
value is before the first non-NULL
, then keep it NULL
.
Consider this data for example:
# | user_id | some_date | animal |
# |---------|------------|---------|
# | 1 | 2022-01-01 | NULL | <~~ keep as NULL
# | 1 | 2022-01-02 | zebra | <~~ 'zebra' is the first non-NULL value for user_id = 1
# | 1 | 2022-01-03 | lion |
# | 1 | 2022-01-04 | NULL | <~~ replace NULL with 'no_animal'
# | 1 | 2022-01-05 | cat |
# | 2 | 2023-10-05 | NULL | <~~ keep as NULL
# | 2 | 2023-10-06 | NULL | <~~ keep as NULL
# | 2 | 2023-10-07 | dog | <~~ 'dog' is the first non-NULL value for user_id = 2
# | 2 | 2023-10-08 | frog |
# | 2 | 2023-10-09 | NULL | <~~ replace NULL with 'no_animal'
# | 3 | 2024-02-03 | hamster | <~~ 'hamster' is the first non-NULL value for user_id = 3
# | 3 | 2024-02-04 | rabbit |
# | 3 | 2024-02-05 | NULL | <~~ replace NULL with 'no_animal'
# | 3 | 2024-02-06 | NULL | <~~ replace NULL with 'no_animal'
The desired output should be:
# | user_id | some_date | animal | replaced_null |
# |---------|------------|---------|---------------|
# | 1 | 2022-01-01 | NULL | NULL |
# | 1 | 2022-01-02 | zebra | zebra |
# | 1 | 2022-01-03 | lion | lion |
# | 1 | 2022-01-04 | NULL | no_animal |
# | 1 | 2022-01-05 | cat | cat |
# | 2 | 2023-10-05 | NULL | NULL |
# | 2 | 2023-10-06 | NULL | NULL |
# | 2 | 2023-10-07 | dog | dog |
# | 2 | 2023-10-08 | frog | frog |
# | 2 | 2023-10-09 | NULL | no_animal |
# | 3 | 2024-02-03 | hamster | hamster |
# | 3 | 2024-02-04 | rabbit | rabbit |
# | 3 | 2024-02-05 | NULL | no_animal |
# | 3 | 2024-02-06 | NULL | no_animal |
I use AWS Athena that runs on Trino SQL.
WITH my_tbl AS (
SELECT *
FROM (VALUES
(1, DATE '2022-01-01', NULL),
(1, DATE '2022-01-02', 'zebra'),
(1, DATE '2022-01-03', 'lion'),
(1, DATE '2022-01-04', NULL),
(1, DATE '2022-01-05', 'cat'),
(2, DATE '2023-10-05', NULL),
(2, DATE '2023-10-06', NULL),
(2, DATE '2023-10-07', 'dog'),
(2, DATE '2023-10-08', 'frog'),
(2, DATE '2023-10-09', NULL),
(3, DATE '2024-02-03', 'hamster'),
(3, DATE '2024-02-04', 'rabbit'),
(3, DATE '2024-02-05', NULL),
(3, DATE '2024-02-06', NULL)
) AS t(user_id, some_date, animal)
)
You can use coalesce
with conditional lag
ignoring nulls:
SELECT *,
COALESCE(animal,
if(LAG(animal) ignore nulls over (PARTITION by user_id order by some_date) is not null, 'no_animal'))
as replaced_null
FROM my_tbl
ORDER by user_id, some_date;
Output:
user_id | some_date | animal | replaced_null |
---|---|---|---|
1 | 2022-01-01 | NULL | NULL |
1 | 2022-01-02 | zebra | zebra |
1 | 2022-01-03 | lion | lion |
1 | 2022-01-04 | NULL | no_animal |
1 | 2022-01-05 | cat | cat |
2 | 2023-10-05 | NULL | NULL |
2 | 2023-10-06 | NULL | NULL |
2 | 2023-10-07 | dog | dog |
2 | 2023-10-08 | frog | frog |
2 | 2023-10-09 | NULL | no_animal |
3 | 2024-02-03 | hamster | hamster |
3 | 2024-02-04 | rabbit | rabbit |
3 | 2024-02-05 | NULL | no_animal |
3 | 2024-02-06 | NULL | no_animal |