Search code examples
postgresqlwindow-functions

PostgreSQL: Forward fill NULL values with previous NOT NULL value in group


I'm trying fill NULL values in multiple columns (different column types INT, VARCHAR) with previous NOT NULL value in a group ordered by date. Considering following table:

CREATE TABLE IF NOT EXISTS test (
   id VARCHAR,
   date DATE,
   value_1 INT,
   value_2 VARCHAR
);


INSERT INTO test VALUES 
(1, '2022-01-04', 5, 'asdf'),
(1, '2022-01-03', NULL, NULL),
(1, '2022-01-02', NULL, 'def'),
(1, '2022-01-01', 4, NULL),
(2, '2022-01-04', 1, 'a'),
(2, '2022-01-03', NULL, NULL),
(2, '2022-01-02', 2, 'b'),
(2, '2022-01-01', NULL, NULL);

enter image description here

I want to get here (please consider value_1 --> INTEGER, value_2 string):

enter image description here


Solution

  • One day, PostgreSQL may support the IGNORE NULLS option for LEAD and LAG functions.

    In the mean time, you must use window functions to build groups, then select the maximum in each group.

    SELECT id, date,
           MAX(value_1) OVER (PARTITION BY id, grp_1) AS value_1,
           MAX(value_2) OVER (PARTITION BY id, grp_2) AS value_2
    FROM(
        SELECT *,
              COUNT(value_1) OVER (PARTITION BY id ORDER BY Date DESC) as grp_1,
              COUNT(value_2) OVER (PARTITION BY id ORDER BY Date DESC) as grp_2
        FROM test
    ) T
    ORDER BY ID, date
    

    Explanation:

    COUNT(value_1) OVER (PARTITION BY id ORDER BY Date ASC/DESC) is a cumulative count that only increases when value_1 is not null. The result is, for each consecutive value it takes (1, 2, 3), there will be 1 record where value_1 is not null, and 0, 1 or several records where the field is null.

    Example:

    1. First record encountered (following the ORDER BY Date specified in the window);
      value_1 is null => count is 0.
    2. Next record, value_1 is not null => count is 1.
    3. Next record, value_1 is null => count stays at 1.
    4. Next record, value_1 is null => count stays at 1.
    5. Next record, value_1 is not null => count is 2.
      ... and so on.

    Using this pattern, the MAX(value_1) OVER (PARTITION BY id, grp_1) works on this partition of record by taking the non-null value out of each group.

    1. Group [count = 0] has only 1 record with null => The max is null.
    2. Group [count = 1] has 1 non null and 2 null values => The max is the non null value.
    3. Group [count = 2] has 1 non null value (and an unspecified number of null values) => The max is the non null value.

    The pattern keeps going; for every value of count (except 0), the max window function always has exactly 1 non null value to pick.


    Edit:
    The above query is the answer to the original version of the question. Now that dates have been changed to be in the exact opposite order as before, the 3 occurrences of ORDER BY date must all be inverted to match. Basically, 2 negatives make a positive.

    SELECT id, date,
           MAX(value_1) OVER (PARTITION BY id, grp_1) AS value_1,
           MAX(value_2) OVER (PARTITION BY id, grp_2) AS value_2
    FROM(
        SELECT *,
              COUNT(value_1) OVER (PARTITION BY id ORDER BY Date ASC) as grp_1,
              COUNT(value_2) OVER (PARTITION BY id ORDER BY Date ASC) as grp_2
        FROM test
    ) T
    ORDER BY ID, date DESC