sql postgresql aggregate-functions sql-delete gaps-and-islands

Basing OFFSET on another query?

My table structure is roughly as described in this post:

name	processed	processing	updated	ref_time
abc	t	f	27794395	27794160
def	t	f	27794395	27793440
ghi	t	f	27794395	27793440
jkl	f	f	27794395	27794160
mno	t	f	27794395	27793440
pqr	f	t	27794395	27794160

I created a dbfiddle already based on this table structure (more on this below), so there is no need to create your own.

Based on this answer, I am deriving a list of ref_time values to use as a basis for deleting 'old' entries from status_table:

with 
    ref as (
        select ref_time 
        from status_table 
        group by ref_time 
        having bool_and(processed)
        order by ref_time desc
        offset 1
    )
delete from status_table s
using ref r
where s.ref_time = r.ref_time

But now I want to be more sophisticated about what I use as the offset... I would ideally like to keep the most recent ref_time for which all records are processed (as per the above example where offset is 1), but the two most recent ref_time where the second ref_time has more associated records than the first (i.e. offset needs to be 2 to skip over the two most recent ref_time).

I figure that the following query (based on this answer) will help in this task, because it counts the total number of processed records based on ref_time:

select ref_time, 
    count(*) cnt_total,
    count(*) filter(where processed) cnt_processed,
    round(avg(processed::int),2) ratio_processed
from status_table
group by ref_time
order by ratio_processed desc, ref_time desc;

So in this dbfiddle I'd need to preserve ref_time=27794160 (rather than include it in the delete list as is the case in the example) because, although it is second, it also has a higher cnt_total than the first.

In general, the rule is that I want to keep all ref_time up to (but not including) the ref_time having the same cnt_total as the one before (or less).

Solution

WITH sel AS (
   SELECT ref_time
   FROM  (
      SELECT ref_time
           , count(*) FILTER (WHERE drop) OVER (ORDER BY ref_time DESC) AS drops
      FROM  (
         SELECT ref_time
              , lag(count(*)) OVER (ORDER BY ref_time DESC) >= count(*) IS TRUE AS drop
         FROM   status_table
         GROUP  BY ref_time
         HAVING bool_and(processed)
         ) sub1
      ) sub2
   WHERE drops > 0
   )
DELETE FROM status_table d
USING  sel s
WHERE  d.ref_time = s.ref_time;

fiddle

Subquery sub1 is mostly what you already had. Plus some more:

We only need count(*) since HAVING bool_and(processed) excludes other cases anyway.

lag(count(*)) OVER (ORDER BY ref_time DESC) >= count(*) checks whether the previous row has been same or greater, in which case, we'll call that a "drop". And we want to keep all rows before the first "drop".

So count(*) FILTER (WHERE drop) OVER (ORDER BY ref_time DESC) AS drops in sub2 counts those "drops", and WHERE drops > 0 eliminate the leading rows we want to preserve.

About this "gaps-and-islands" approach:

Select longest continuous sequence

If there can be concurrent writes on status_table, you may need to add the locking clause FOR UPDATE to be sure. But you cannot do that in an aggregate query, so you would add another subquery where you do that ... See:

Postgres UPDATE ... LIMIT 1