Search code examples
sqlpostgresqlgreatest-n-per-groupwindow-functionsgaps-and-islands

Count the most recent number of consecutive rows with a null value


I'm doing some data analysis in PostgreSQL 10.6, based on this demo setup:

create table history (
    registered_unix int,
    requested_unix int,
    alias character varying(255)
);

INSERT INTO history VALUES
  (1537841388, 1537878224, '3')
, (1538093202, 1538095740, '1')
, (1538093186, 1538095740, '3')
, (1538105501, 1538107039, '2')
, (1538105501, 1538107039, '4')
, (1538205007, 1538242243, '2')
, (1538205012, NULL      , '1')
, (1538105501, NULL      , '1')
, (1538205007, NULL      , '3')
, (1538105501, NULL      , '3')
, (1538040863, NULL      , '3')
, (1537985996, NULL      , '3')
, (1538205007, NULL      , '4')
, (1538093186, NULL      , '4')
, (1538205301, NULL      , '5')
, (1538105501, NULL      , '5')
, (1538093186, NULL      , '5');

I'm trying to count the number of consecutive alias rows where requested_unix is null, but only for the first occurrence and where the consecutive occurrences are greater or equal to 2.

The data is sorted by alias, then registered_unix DESC:

enter image description here

Desired result:

enter image description here

Alias 1: Included b/c its 2 most recent "requested_unix" values are null.

Alias 2: Skipped. Requested unix has a value on its most recent row

Alias 3: Included b/c its 2 most recent "requested_unix" values are null. The older ones are discarded.

Alias 4: Skipped. Only has 1 most recent "requested_unix" value with null. 2nd row has a value.

Alias 5: Included. 3 consecutive most recent values are null.

There are all kind of tickets with counting consecutive occurrences or finding the first occurrence, but it seems I want to combine the two approaches.


Solution

  • Using DISTINCT ON, we only need a single subquery:

    SELECT alias
         , CASE WHEN requested_unix IS NULL THEN ct   ELSE rn - 1          END AS missed
         , CASE WHEN requested_unix IS NULL THEN NULL ELSE registered_unix END AS last_success
         , most_recent
    FROM  (
       SELECT DISTINCT ON (alias)
              *
            , row_number()         OVER (PARTITION BY alias ORDER BY registered_unix DESC) AS rn
            , count(*)             OVER (PARTITION BY alias) AS ct
            , max(registered_unix) OVER (PARTITION BY alias) AS most_recent
       FROM   history h
       ORDER  BY alias, requested_unix IS NULL, registered_unix DESC
       ) latest_success
    WHERE (requested_unix IS NULL OR rn > 2);
    

    db<>fiddle here

    In the subquery, retrieve a single row per alias with DISTINCT ON (alias). ORDER BY makes it the one we need.

    • The 1st ORDER-BY item alias must agree with DISTINCT ON.

    • The 2nd ORDER-BY item requested_unix IS NULL sorts notnull values on top. (FALSE sorts before TRUE.) requested_unix IS NOT NULL DESC might be more intuitive, but I prefer short code.

    • The 3rd ORDER-BY item registered_unix DESC picks the latest row among those.

    Add row number (rn), row count (ct), and the most recent registered_unix (most_recent) per partition with window functions (processed before DISTINCT ON), and you have all information necessary in a single query level. The outer SELECT is just to assemble the requested result format.

    Related: