sql postgresql greatest-n-per-group window-functions gaps-and-islands

Count the most recent number of consecutive rows with a null value

I'm doing some data analysis in PostgreSQL 10.6, based on this demo setup:

create table history (
    registered_unix int,
    requested_unix int,
    alias character varying(255)
);

INSERT INTO history VALUES
  (1537841388, 1537878224, '3')
, (1538093202, 1538095740, '1')
, (1538093186, 1538095740, '3')
, (1538105501, 1538107039, '2')
, (1538105501, 1538107039, '4')
, (1538205007, 1538242243, '2')
, (1538205012, NULL      , '1')
, (1538105501, NULL      , '1')
, (1538205007, NULL      , '3')
, (1538105501, NULL      , '3')
, (1538040863, NULL      , '3')
, (1537985996, NULL      , '3')
, (1538205007, NULL      , '4')
, (1538093186, NULL      , '4')
, (1538205301, NULL      , '5')
, (1538105501, NULL      , '5')
, (1538093186, NULL      , '5');

I'm trying to count the number of consecutive alias rows where requested_unix is null, but only for the first occurrence and where the consecutive occurrences are greater or equal to 2.

The data is sorted by alias, then registered_unix DESC:

Desired result:

Alias 1: Included b/c its 2 most recent "requested_unix" values are null.

Alias 2: Skipped. Requested unix has a value on its most recent row

Alias 3: Included b/c its 2 most recent "requested_unix" values are null. The older ones are discarded.

Alias 4: Skipped. Only has 1 most recent "requested_unix" value with null. 2nd row has a value.

Alias 5: Included. 3 consecutive most recent values are null.

There are all kind of tickets with counting consecutive occurrences or finding the first occurrence, but it seems I want to combine the two approaches.

Solution

Using DISTINCT ON, we only need a single subquery:

SELECT alias
     , CASE WHEN requested_unix IS NULL THEN ct   ELSE rn - 1          END AS missed
     , CASE WHEN requested_unix IS NULL THEN NULL ELSE registered_unix END AS last_success
     , most_recent
FROM  (
   SELECT DISTINCT ON (alias)
          *
        , row_number()         OVER (PARTITION BY alias ORDER BY registered_unix DESC) AS rn
        , count(*)             OVER (PARTITION BY alias) AS ct
        , max(registered_unix) OVER (PARTITION BY alias) AS most_recent
   FROM   history h
   ORDER  BY alias, requested_unix IS NULL, registered_unix DESC
   ) latest_success
WHERE (requested_unix IS NULL OR rn > 2);

db<>fiddle here

In the subquery, retrieve a single row per alias with DISTINCT ON (alias). ORDER BY makes it the one we need.

The 1st ORDER-BY item alias must agree with DISTINCT ON.
The 2nd ORDER-BY item requested_unix IS NULL sorts notnull values on top. (FALSE sorts before TRUE.) requested_unix IS NOT NULL DESC might be more intuitive, but I prefer short code.
The 3rd ORDER-BY item registered_unix DESC picks the latest row among those.

Add row number (rn), row count (ct), and the most recent registered_unix (most_recent) per partition with window functions (processed before DISTINCT ON), and you have all information necessary in a single query level. The outer SELECT is just to assemble the requested result format.