I'm new to SQL, and I tried searching online for a similar post to my problem but came up empty. So here goes.
I have a table t1
that collates survey responses with the following columns: date
, id
, group
, and level
. There are more columns but these are basically the ones I'm concerned with. The first 3 columns tell me who answered the survey and when (there's a many to many relationship between the id
and group
column values). There is at most 1 entry for each id
-group
combination in each month (with the date value pegged on the first day of the month). The level
column is their response to the survey.
For each existing pair of id
and group
, I want to get the level
for each month from 2022 and 2023. The problem is, I have lots of gaps in that column. Take this small example:
date | id | group | level
2022-01-01 | 1 | A | null
2022-02-01 | 1 | A | 3
2022-03-01 | 1 | A | null
2022-04-01 | 1 | A | null
2022-05-01 | 1 | A | 4
2022-06-01 | 1 | A | 5
2022-07-01 | 1 | A | null
So, I want to extrapolate the data that I do have. For each row with a null level
, I want to assign the level
of the most recent non-null month prior to that "missing" month. If the row with the null value has no prior month with a non-null value (i.e. it's January 2022), get the value of the earliest month with a non-null level
that occurs after the null row. So the desired output would be:
date | id | group | level
2022-01-01 | 1 | A | 3
2022-02-01 | 1 | A | 3
2022-03-01 | 1 | A | 3
2022-04-01 | 1 | A | 3
2022-05-01 | 1 | A | 4
2022-06-01 | 1 | A | 5
2022-07-01 | 1 | A | 5
My approach was like this:
I used a table t2
to grab all the rows from t1
with non-null level
values. I then created CTE t3
to basically be a duplicate of t1
, with one extra column called insert
. This new column will be where I put the extrapolated t1.level
values and I'll use a simple update function to transfer t3.insert
to the corresponding null t1.level
rows.
with t3 as
Select distinct t1.date, t1.id, t1.group, t1.level,
Case
--for when the null t1 record is earlier than the earliest t2 record
When (t2.date = (select min(t2.date) where (t1.id, t1.group) = (t2.id, t2.group)))
Then t2.level
--when the null t1 record is later than some existing t2 record
When (t2.date = (select max(t2.date) where (t1.id, t1.group) = (t2.id, t2.group) AND t2.date <= t1.date))
Then t2.level
Else null
End as "insert"
from t1
Join t2 on (t1.id, t1.group) = (t2.id, t2.group)
GROUP BY t1.date, t1.id, t1.group, t1.level, t2.id, t2.group, t2.date, t2.level
ORDER BY t1.id, t1.group, t1.date
)
I couldn't even get to the part where I update t1.level
because I ran into a problem. I found that for every pair of id
and group
, t3
was assigning every t2.level
value to each t3.insert
. For the January 2022 record for id
= 1 and group
= A:
date | id | group | level | insert
2022-01-01 | 1 | A | null | 3
2022-01-01 | 1 | A | null | 4
2022-01-01 | 1 | A | null | 5
It seems like postgresql is ignoring the min() function in my case statement. I probably made a really rookie mistake.
The following query demonstrates an approach for determining the substitution values for NULL survey levels by finding the prior non-NULL level, or if none exists, then the next non-NULL level.
WITH t1(survey_date, id, survey_group, survey_level) AS (
VALUES ('2022-01-01'::date, 1, 'A', null),
('2022-02-01'::date, 1, 'A', 3),
('2022-03-01'::date, 1, 'A', null),
('2022-04-01'::date, 1, 'A', null),
('2022-05-01'::date, 1, 'A', 4),
('2022-06-01'::date, 1, 'A', 5),
('2022-07-01'::date, 1, 'A', null)
)
SELECT t1.survey_date, t1.id, t1.survey_group, t1.survey_level,
COALESCE(t1.survey_level,
-- prior non-NULL level
(array_agg(t1.survey_level)
FILTER (WHERE t1.survey_level IS NOT NULL)
OVER (PARTITION BY t1.id, t1.survey_group
ORDER BY t1.survey_date DESC
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING))[1],
-- next non-NULL level
(array_agg(t1.survey_level)
FILTER (WHERE t1.survey_level IS NOT NULL)
OVER (PARTITION BY t1.id, t1.survey_group
ORDER BY t1.survey_date
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING))[1]) AS substitute_level
FROM t1
ORDER BY t1.survey_date, t1.id, t1.survey_group;
The prior and next values are collected into arrays ordered such that the first element of the array is the nearest non-NULL value in the given direction. Because the prior values are sorted in descending order, the window frame is defined using FOLLOWING
instead of PRECEDING
.