Search code examples
sqlpostgresqlwindow-functionsgaps-and-islandstimestamp-with-timezone

Get most recent streak (consecutive days) for given event and time zone


This is a tough one I've been fighting with for a bit. I have a table called action_events that stores event_name and timestamp:

CREATE TABLE action_events (
  id SERIAL PRIMARY KEY,
  timestamp TIMESTAMP NOT NULL,
  event_name VARCHAR(255) NOT NULL
);

I want to grab the most recent "streak" for an event with a given name.

A "streak" is the number of consecutive days that the event has occurred at least once. An event may occur more than once a day. Big gotcha: The streak should also take into account a given timezone.

Using this fiddle: https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/7828

Given a query for "exercise" in "MST" timezone, I'd expect the streak to be:

Streak Count Name TimeZone Start Date End Date
13 "exercise" "MST" 2023-02-18 09:00:00 2023-02-30 09:00:00

Even though the streak ended a month ago, it should still show as the most recent streak.


Solution

  • Plain timestamp data is ignorant of time zones. A given time zone is meaningless while we don't know the time zone of the timestamp data.
    I'll assume your timestamps are supposed to represent UTC. Should really be timestamptz to avoid ambiguity and an additional transformation.

    For few rows per event

    Plain SQL. The query won't get much snappier than this:

    SELECT sum(ct)       AS streak_count
         , 'exercise'    AS event_name
         , 'MST'         AS timezone
         , min(min_ts)   AS start_date
         , max(max_ts)   AS end_date
    FROM  (
       SELECT *, the_day - row_number() OVER (ORDER BY the_day)::int AS streak
       FROM  (
          SELECT (timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'MST')::date AS the_day
               , count(*) AS ct
               , min(timestamp) AS min_ts
               , max(timestamp) AS max_ts           
          FROM   action_events
          WHERE  event_name = 'exercise'
          GROUP  BY 1
          ) sub1
       ) sub2
    GROUP  BY streak
    ORDER  BY end_date DESC
    LIMIT  1;
    

    fiddle

    Step-by-step

    sub1

    Filter only the event_name of interest right away - event_name = 'exercise' in the example.
    timestamp AT TIME ZONE 'UTC' produces timestamptz.
    ... AT TIME ZONE 'MST' then returns the corresponding timestamp at 'MST'. See:

    Cast to ::date (the_day), and group by that.
    So we get (at most) one row per day and carry along count, min, and max.

    sub2

    To identify streaks, simply subtract the row number (type integer!) from the_day (type date!). See:

    This produces the same (otherwise meaningless) day for consecutive days.
    (sub1 and sub2 could be merged, but that'd be unwieldy.)

    Outer SELECT

    Aggregate per streak, take sum of counts, min of min, and max of max. Sort by min (or max, same result) descending and take the first row (LIMIT 1).
    Voilá.

    In the result, start_date and end_date represent original timestamps (in UTC). You might want to show timestamps at 'MST' instead. You didn't say.

    No minimum streak length has been defined, so it could be just a single row.

    But since this processes all rows for the given event, it does not scale well for many rows.

    Many rows per event

    It will be (much) faster to start with the latest row and loop until a gap is encountered. While no minimum streak length is required, we can't go wrong.

    Demonstrating a PL/pgSQL function:

    CREATE OR REPLACE FUNCTION f_latest_streak(
       INOUT name        text
     , INOUT timezone    text
     , OUT   steak_count int
     , OUT   start_date  timestamp
     , OUT   end_date    timestamp)
      LANGUAGE plpgsql STRICT PARALLEL SAFE STABLE AS
    $func$
    DECLARE
       _day_start   timestamp;
       _start_date  timestamp;
       _streak_step int;
    BEGIN
       -- get end & UTC time for start of latest day at given time zone
       SELECT max(a.timestamp)
            , date_trunc('day', max(a.timestamp) AT TIME ZONE 'UTC' AT TIME ZONE timezone)
              AT TIME ZONE 'UTC' AT TIME ZONE 'UTC'  -- sic!
       INTO   end_date, _day_start
       FROM   action_events a
       WHERE  event_name = name;
       
       IF NOT FOUND THEN  -- no rows at all
          RETURN;
       END IF;
       
       -- get count for first day
       SELECT count(*)::int, min(timestamp)
       INTO   steak_count, start_date
       FROM   action_events a
       WHERE  a.event_name = name  -- careful with naming conflicts!
       AND    a.timestamp >= _day_start;
    
       -- more days?
       LOOP
          SELECT count(*)::int, min(timestamp)
          INTO   _streak_step, _start_date
          FROM   action_events a
          WHERE  a.event_name = name  -- careful with naming conflicts!
          AND    a.timestamp >= _day_start - interval '1 day'
          AND    a.timestamp <  _day_start;
          
          IF _streak_step = 0 THEN  -- streak ends here
             RETURN;
          ELSE
             steak_count := steak_count + _streak_step;         
             start_date  := _start_date;
             _day_start  := _day_start - interval '1 day';
          END IF;
       END LOOP;
    END
    $func$;
    

    Call:

    SELECT * FROM f_latest_streak('exercise', 'MST');
    

    A multicolumn index on (event_name, timestamp) will make the function fast.

    You should be comfortable with PL/pgSQL to play with this.