sql postgresql window-functions gaps-and-islands

How to label groups in postgresql when group belonging depends on the preceding line?

I want, in a request, to fill all Null values by the last known value. When it's in a table and not in a request, it's easy:

If I define and fill my table as follows:

CREATE TABLE test_fill_null (
    date INTEGER,
    value INTEGER
);

INSERT INTO test_fill_null VALUES 
    (1,2),
    (2, NULL), 
    (3, 45), 
    (4,NULL), 
    (5, null);

SELECT * FROM test_fill_null ;
 date | value 
------+-------
    1 |     2
    2 |      
    3 |    45
    4 |      
    5 |

Then I just have to fill like that:

UPDATE test_fill_null t1
SET value = (
    SELECT t2.value 
    FROM test_fill_null t2 
    WHERE t2.date <= t1.date AND value IS NOT NULL 
    ORDER BY t2.date DESC 
    LIMIT 1
);

SELECT * FROM test_fill_null;
 date | value 
------+-------
    1 |     2
    2 |     2
    3 |    45
    4 |    45
    5 |    45

But now, I'm in a request, like this one:

WITH
    pre_table AS(
        SELECT 
            id1,
            id2,
            tms,
            CASE 
                WHEN tms - lag(tms) over w < interval '5 minutes' THEN NULL
                ELSE id2
            END as group_id
        FROM
            table0 
        window w as (partition by id1 order by tms)
    )

Where the group_id is set to id2 when the previous point is distant from more than 5 minutes, null otherwise. By doing so, I want to end up with group of points that follow each other by less than 5 minutes, and gaps of more than 5 minutes between each groups.

Then I don't know how to proceed. I tried:

    SELECT distinct on (id1, id2)
        t0.id1,
        t0.id2,
        t0.tms,
        t1.group_id
    FROM
        pre_table t0
        LEFT JOIN (
            select
                id1,
                tms,
                group_id
            from pre_table t2
            where t2.group_id is not null
            order by tms desc
        ) t1
        ON 
            t1.tms <= t0.tms AND
            t1.id1 = t0.id1  
    WHERE 
        t0.id1 IS NOT NULL
    ORDER BY
        id1,
        id2,
        t1.tms DESC

But in the final result I have some group with two consecutive points which are distant from more than 5 minutes. Their should be two different groups in this case.

Solution

A "select within a select" is more commonly called "subselect" or "subquery". In your particular case it's a correlated subquery. LATERAL joins can largely replace correlated subqueries with more flexible solutions:

What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?

I don't think you need either here.

For your first case this query is probably faster and simpler, though:

SELECT date, max(value) OVER (PARTITION BY grp) AS value
FROM  (
   SELECT *, count(value) OVER (ORDER BY date) AS grp
   FROM   test_fill_null
   ) sub;

count() only counts non-null values, so grp is incremented with every non-null value, thereby forming groups as desired. It's trivial to pick the one non-null value per grp in the outer SELECT.

For your second case, I'll assume the initial order of rows is determined by (id1, id2, tms) as indicated by one of your queries.

SELECT id1, id2, tms
     , count(step) OVER (ORDER BY id1, id2, tms) AS group_id
FROM  (
   SELECT *, lag(tms, 1, '-infinity') OVER (PARTITION BY id1 ORDER BY id2, tms) < tms - interval '5 min' AS step
   FROM   table0
   ) sub
ORDER  BY id1, id2, tms;

Adapt to your actual order. One of these might cover it:

PARTITION BY id1 ORDER BY id2  -- ignore tms
PARTITION BY id1 ORDER BY tms  -- ignore id2

fiddle - with an extended examples
_{Old sqlfiddle}

Select longest continuous sequence