I'm doing some data analysis in PostgreSQL 10.6, based on this demo setup:
create table history (
registered_unix int,
requested_unix int,
alias character varying(255)
);
INSERT INTO history VALUES
(1537841388, 1537878224, '3')
, (1538093202, 1538095740, '1')
, (1538093186, 1538095740, '3')
, (1538105501, 1538107039, '2')
, (1538105501, 1538107039, '4')
, (1538205007, 1538242243, '2')
, (1538205012, NULL , '1')
, (1538105501, NULL , '1')
, (1538205007, NULL , '3')
, (1538105501, NULL , '3')
, (1538040863, NULL , '3')
, (1537985996, NULL , '3')
, (1538205007, NULL , '4')
, (1538093186, NULL , '4')
, (1538205301, NULL , '5')
, (1538105501, NULL , '5')
, (1538093186, NULL , '5');
I'm trying to count the number of consecutive alias
rows where requested_unix
is null, but only for the first occurrence and where the consecutive occurrences are greater or equal to 2.
The data is sorted by alias, then registered_unix DESC:
Desired result:
Alias 1: Included b/c its 2 most recent "requested_unix" values are null.
Alias 2: Skipped. Requested unix has a value on its most recent row
Alias 3: Included b/c its 2 most recent "requested_unix" values are null. The older ones are discarded.
Alias 4: Skipped. Only has 1 most recent "requested_unix" value with null. 2nd row has a value.
Alias 5: Included. 3 consecutive most recent values are null.
There are all kind of tickets with counting consecutive occurrences or finding the first occurrence, but it seems I want to combine the two approaches.
Using DISTINCT ON
, we only need a single subquery:
SELECT alias
, CASE WHEN requested_unix IS NULL THEN ct ELSE rn - 1 END AS missed
, CASE WHEN requested_unix IS NULL THEN NULL ELSE registered_unix END AS last_success
, most_recent
FROM (
SELECT DISTINCT ON (alias)
*
, row_number() OVER (PARTITION BY alias ORDER BY registered_unix DESC) AS rn
, count(*) OVER (PARTITION BY alias) AS ct
, max(registered_unix) OVER (PARTITION BY alias) AS most_recent
FROM history h
ORDER BY alias, requested_unix IS NULL, registered_unix DESC
) latest_success
WHERE (requested_unix IS NULL OR rn > 2);
db<>fiddle here
In the subquery, retrieve a single row per alias
with DISTINCT ON (alias)
. ORDER BY
makes it the one we need.
The 1st ORDER-BY item alias
must agree with DISTINCT ON
.
The 2nd ORDER-BY item requested_unix IS NULL
sorts notnull values on top. (FALSE
sorts before TRUE
.) requested_unix IS NOT NULL DESC
might be more intuitive, but I prefer short code.
The 3rd ORDER-BY item registered_unix DESC
picks the latest row among those.
Add row number (rn
), row count (ct
), and the most recent registered_unix
(most_recent
) per partition with window functions (processed before DISTINCT ON
), and you have all information necessary in a single query level. The outer SELECT
is just to assemble the requested result format.
Related: