Search code examples
sqlpostgresqlwindow-functionsgaps-and-islands

PostgreSQL fill in the blanks in an outer join


Outer Join 'fill-in-the blanks'

I have a pair of master-detail tables in a PostgreSQL database where master table 'samples' has some samples with a timestamp in each. The detail table 'sample_values' has some values for some parameters at any given sample timestamp.

My Query

SELECT s.sample_id, s.sample_time, v.parameter_id, v.sample_value
FROM samples s LEFT OUTER JOIN sample_values v ON v.sample_id=s.sample_id
ORDER BY s.sample_id, v.parameter_id;

returns (as expected):

sample_id sample_time parameter_id sample_value
1 2023-01-13T01:00:00.000Z 1 1.23
1 2023-01-13T01:00:00.000Z 2 4.98
2 2023-01-13T01:01:00.000Z
3 2023-01-13T01:02:00.000Z
4 2023-01-13T01:03:00.000Z
5 2023-01-13T01:04:00.000Z 2 6.08
6 2023-01-13T01:05:00.000Z
7 2023-01-13T01:06:00.000Z 1 1.89
8 2023-01-13T01:07:00.000Z
9 2023-01-13T01:08:00.000Z
10 2023-01-13T01:09:00.000Z
11 2023-01-13T01:10:00.000Z
12 2023-01-13T01:11:00.000Z
13 2023-01-13T01:12:00.000Z
14 2023-01-13T01:13:00.000Z
15 2023-01-13T01:14:00.000Z 1 2.11
16 2023-01-13T01:15:00.000Z
17 2023-01-13T01:16:00.000Z
18 2023-01-13T01:17:00.000Z
19 2023-01-13T01:18:00.000Z 2 3.57
20 2023-01-13T01:19:00.000Z
21 2023-01-13T01:20:00.000Z
22 2023-01-13T01:21:00.000Z
23 2023-01-13T01:22:00.000Z 1 3.21
23 2023-01-13T01:22:00.000Z 2 5.31

How do I write a query that returns one row per timestamp per parameter, where sample_value is the 'latest known' sample_value for that parameter like this:

sample_id sample_time parameter_id sample_value
1 2023-01-13T01:00:00.000Z 1 1.23
1 2023-01-13T01:00:00.000Z 2 4.98
2 2023-01-13T01:01:00.000Z 1 1.23
2 2023-01-13T01:01:00.000Z 2 4.98
3 2023-01-13T01:02:00.000Z 1 1.23
3 2023-01-13T01:02:00.000Z 2 4.98
4 2023-01-13T01:03:00.000Z 1 1.23
4 2023-01-13T01:03:00.000Z 2 4.98
5 2023-01-13T01:04:00.000Z 1 1.23
5 2023-01-13T01:04:00.000Z 2 6.08
6 2023-01-13T01:05:00.000Z 1 1.23
6 2023-01-13T01:05:00.000Z 2 6.08
7 2023-01-13T01:06:00.000Z 1 1.89
7 2023-01-13T01:06:00.000Z 2 6.08
8 2023-01-13T01:07:00.000Z 1 1.89
8 2023-01-13T01:07:00.000Z 2 6.08

View on DB Fiddle

I cannot get my head around the LAST_VALUE function (if that is even the right tool for this?):

LAST_VALUE ( expression )  
OVER ( 
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Solution

  • First of all you need two rows for each of your sample ids. You can achieve it by cross joining your sample values with the distinct amount of parameters, and ensuring the condition on parameters is met as well on the left join.

    ...
    FROM samples s
    CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
    LEFT JOIN sample_values v 
           ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
    ...
    

    In addition to this, your intuition of using the LAST_VALUE window function was correct. Problem is that PostgreSQL is unable to ignore null values till its current version. The only workaround for this problem is to generate partitioning on your parameter_ids and sample_value (each partition will contain one non-null value and the other null values), then taking the maximum value from each partition.

    WITH cte AS (
        SELECT s.sample_id, s.sample_time, p.parameter_id, v.sample_value,
               COUNT(v.sample_value) OVER(
                   PARTITION BY p.parameter_id 
                   ORDER     BY s.sample_id
               ) AS partitions
        FROM samples s
        CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
        LEFT JOIN sample_values v 
               ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
    )
    SELECT sample_id, sample_time, parameter_id, 
           COALESCE(sample_value, 
                    MAX(sample_value) OVER (PARTITION BY parameter_id, partitions)
           ) AS sample_value
    FROM cte
    ORDER BY sample_id, parameter_id
    

    Check the demo here.