Search code examples
sqlpostgresqlconditional-statementscase

How do I update null cells based on the non-null cells from the same column using case statements?


I'm new to SQL, and I tried searching online for a similar post to my problem but came up empty. So here goes.

I have a table t1 that collates survey responses with the following columns: date, id, group, and level. There are more columns but these are basically the ones I'm concerned with. The first 3 columns tell me who answered the survey and when (there's a many to many relationship between the id and group column values). There is at most 1 entry for each id-group combination in each month (with the date value pegged on the first day of the month). The level column is their response to the survey.

For each existing pair of id and group, I want to get the level for each month from 2022 and 2023. The problem is, I have lots of gaps in that column. Take this small example:

date | id | group | level
2022-01-01 | 1 | A | null
2022-02-01 | 1 | A | 3
2022-03-01 | 1 | A | null
2022-04-01 | 1 | A | null
2022-05-01 | 1 | A | 4
2022-06-01 | 1 | A | 5
2022-07-01 | 1 | A | null

So, I want to extrapolate the data that I do have. For each row with a null level, I want to assign the level of the most recent non-null month prior to that "missing" month. If the row with the null value has no prior month with a non-null value (i.e. it's January 2022), get the value of the earliest month with a non-null level that occurs after the null row. So the desired output would be:

date | id | group | level
2022-01-01 | 1 | A | 3
2022-02-01 | 1 | A | 3
2022-03-01 | 1 | A | 3
2022-04-01 | 1 | A | 3
2022-05-01 | 1 | A | 4
2022-06-01 | 1 | A | 5
2022-07-01 | 1 | A | 5

My approach was like this: I used a table t2 to grab all the rows from t1 with non-null level values. I then created CTE t3 to basically be a duplicate of t1, with one extra column called insert. This new column will be where I put the extrapolated t1.level values and I'll use a simple update function to transfer t3.insert to the corresponding null t1.level rows.

with t3 as 
    Select distinct t1.date, t1.id, t1.group, t1.level,
    Case
--for when the null t1 record is earlier than the earliest t2 record
        When (t2.date = (select min(t2.date) where (t1.id, t1.group) = (t2.id, t2.group)))
        Then t2.level
--when the null t1 record is later than some existing t2 record
        When (t2.date = (select max(t2.date) where (t1.id, t1.group) = (t2.id, t2.group) AND t2.date <= t1.date))
        Then t2.level
        Else null
    End as "insert"
    from t1
    Join t2 on (t1.id, t1.group) = (t2.id, t2.group)
    GROUP BY t1.date, t1.id, t1.group, t1.level, t2.id, t2.group, t2.date, t2.level
    ORDER BY t1.id, t1.group, t1.date
)

I couldn't even get to the part where I update t1.level because I ran into a problem. I found that for every pair of id and group, t3 was assigning every t2.level value to each t3.insert. For the January 2022 record for id = 1 and group = A:

date | id | group | level | insert
2022-01-01 | 1 | A | null | 3
2022-01-01 | 1 | A | null | 4
2022-01-01 | 1 | A | null | 5

It seems like postgresql is ignoring the min() function in my case statement. I probably made a really rookie mistake.


Solution

  • The following query demonstrates an approach for determining the substitution values for NULL survey levels by finding the prior non-NULL level, or if none exists, then the next non-NULL level.

    WITH t1(survey_date, id, survey_group, survey_level) AS (
      VALUES ('2022-01-01'::date, 1, 'A', null),
             ('2022-02-01'::date, 1, 'A', 3),
             ('2022-03-01'::date, 1, 'A', null),
             ('2022-04-01'::date, 1, 'A', null),
             ('2022-05-01'::date, 1, 'A', 4),
             ('2022-06-01'::date, 1, 'A', 5),
             ('2022-07-01'::date, 1, 'A', null)
    )
    SELECT t1.survey_date, t1.id, t1.survey_group, t1.survey_level,
           COALESCE(t1.survey_level,
                    -- prior non-NULL level
                    (array_agg(t1.survey_level)
                       FILTER (WHERE t1.survey_level IS NOT NULL)
                       OVER (PARTITION BY t1.id, t1.survey_group
                             ORDER BY t1.survey_date DESC
                             ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING))[1],
                    -- next non-NULL level
                    (array_agg(t1.survey_level)
                       FILTER (WHERE t1.survey_level IS NOT NULL)
                       OVER (PARTITION BY t1.id, t1.survey_group
                             ORDER BY t1.survey_date
                             ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING))[1]) AS substitute_level
      FROM t1
      ORDER BY t1.survey_date, t1.id, t1.survey_group;
    

    The prior and next values are collected into arrays ordered such that the first element of the array is the nearest non-NULL value in the given direction. Because the prior values are sorted in descending order, the window frame is defined using FOLLOWING instead of PRECEDING.