Search code examples
sqlwindow-functionsamazon-athenaprestotrino

Trino/Presto SQ: Replace NULL with a value only if the NULL comes after the first non-NULL value in the group


I want to replace NULL values with a specified string. However, I want to make this replacement only for NULL values coming after the first non-NULL. Meaning, if a NULL value is before the first non-NULL, then keep it NULL.

Consider this data for example:

# | user_id | some_date  | animal  |
# |---------|------------|---------|
# | 1       | 2022-01-01 | NULL    | <~~ keep as NULL
# | 1       | 2022-01-02 | zebra   | <~~ 'zebra' is the first non-NULL value for user_id = 1
# | 1       | 2022-01-03 | lion    |
# | 1       | 2022-01-04 | NULL    | <~~ replace NULL with 'no_animal'
# | 1       | 2022-01-05 | cat     |
# | 2       | 2023-10-05 | NULL    | <~~ keep as NULL
# | 2       | 2023-10-06 | NULL    | <~~ keep as NULL
# | 2       | 2023-10-07 | dog     | <~~ 'dog' is the first non-NULL value for user_id = 2
# | 2       | 2023-10-08 | frog    |
# | 2       | 2023-10-09 | NULL    | <~~ replace NULL with 'no_animal'
# | 3       | 2024-02-03 | hamster | <~~ 'hamster' is the first non-NULL value for user_id = 3
# | 3       | 2024-02-04 | rabbit  |
# | 3       | 2024-02-05 | NULL    | <~~ replace NULL with 'no_animal'
# | 3       | 2024-02-06 | NULL    | <~~ replace NULL with 'no_animal'

The desired output should be:

# | user_id | some_date  | animal  | replaced_null |
# |---------|------------|---------|---------------|
# | 1       | 2022-01-01 | NULL    | NULL          |
# | 1       | 2022-01-02 | zebra   | zebra         |
# | 1       | 2022-01-03 | lion    | lion          |
# | 1       | 2022-01-04 | NULL    | no_animal     |
# | 1       | 2022-01-05 | cat     | cat           |
# | 2       | 2023-10-05 | NULL    | NULL          |
# | 2       | 2023-10-06 | NULL    | NULL          |
# | 2       | 2023-10-07 | dog     | dog           |
# | 2       | 2023-10-08 | frog    | frog          |
# | 2       | 2023-10-09 | NULL    | no_animal     |
# | 3       | 2024-02-03 | hamster | hamster       |
# | 3       | 2024-02-04 | rabbit  | rabbit        |
# | 3       | 2024-02-05 | NULL    | no_animal     |
# | 3       | 2024-02-06 | NULL    | no_animal     |

SQL Dialect

I use AWS Athena that runs on Trino SQL.

Reproducible data

WITH my_tbl AS (
    SELECT *
    FROM (VALUES
        (1, DATE '2022-01-01', NULL),
        (1, DATE '2022-01-02', 'zebra'),
        (1, DATE '2022-01-03', 'lion'),
        (1, DATE '2022-01-04', NULL),
        (1, DATE '2022-01-05', 'cat'),
        (2, DATE '2023-10-05', NULL),
        (2, DATE '2023-10-06', NULL),
        (2, DATE '2023-10-07', 'dog'),
        (2, DATE '2023-10-08', 'frog'),
        (2, DATE '2023-10-09', NULL),
        (3, DATE '2024-02-03', 'hamster'),
        (3, DATE '2024-02-04', 'rabbit'),
        (3, DATE '2024-02-05', NULL),
        (3, DATE '2024-02-06', NULL)
    ) AS t(user_id, some_date, animal)
)

Solution

  • You can use coalesce with conditional lag ignoring nulls:

    SELECT *,
      COALESCE(animal, 
        if(LAG(animal) ignore nulls over (PARTITION by user_id order by some_date) is not null, 'no_animal'))
        as replaced_null
    FROM my_tbl
    ORDER by user_id, some_date;
    

    Output:

    user_id some_date animal replaced_null
    1 2022-01-01 NULL NULL
    1 2022-01-02 zebra zebra
    1 2022-01-03 lion lion
    1 2022-01-04 NULL no_animal
    1 2022-01-05 cat cat
    2 2023-10-05 NULL NULL
    2 2023-10-06 NULL NULL
    2 2023-10-07 dog dog
    2 2023-10-08 frog frog
    2 2023-10-09 NULL no_animal
    3 2024-02-03 hamster hamster
    3 2024-02-04 rabbit rabbit
    3 2024-02-05 NULL no_animal
    3 2024-02-06 NULL no_animal