sql postgresql query-optimization window-functions gaps-and-islands

How to calculate a user's daily streak?

My goal

I'm trying to track and display a user's daily streak posting to my app, but struggling to write a query that works reliably and returns an accurate count.

Some context

My app has a prompt and a post table. A user can submit one post for each prompt (the prompts are created daily, so one post per day, per user).

The prompt table, simplified, looks like this:

id	dateKey	text
1	20240101	This is an example prompt.
2	20240102	Here is a second prompt.

The post table, simplified, is something like:

id	content	promptId	authorId
50	This is my response to the prompt.	1	90
51	A second response to the same prompt.	1	91

Current query

I've tried a few different approaches to queries (using PARTITION BY, dense_rank() etc.) but could only get the user's longest streak. The query I had if you're interested:

select distinct on (p."authorId") count(distinct "dateKey"::date) as "streak"
from (select p.*,
      dense_rank() over (partition by p."authorId" order by "dateKey"::date) as seq
      from post p
      join prompt pt on p."promptId" = pt.id
     ) p
join prompt pt on p."promptId" = pt.id
where p."authorId" = 90
group by p."authorId", "dateKey"::date - seq * interval '1 day'
order by p."authorId", streak desc

This appears to work with the below data, but if you add a new "missed" prompt (which should reset the streak), this query would still return 2 (I think I understand why, but not sure how to correct it).

What I need

I basically need something to start at the latest prompt, and go down the list until it finds a prompt with no post from that user.

For example, this joined data would have a streak of 2:

id	dateKey	text	postContent	authorId
1	20240104	This is an example prompt.	This is my response.	90
2	20240103	Here is a second prompt.	A second response.	90
3	20240102	A third prompt.	null	null
4	20240101	My fourth prompt.	A third response, but I missed a day.	90

Functionality most important, but would be nice if it were performant too (prompt might have 1000 rows, post might have millions, and the streak might reach 1000).

I'm a bit lost with PostgreSQL's capabilities for things like this, so hopefully there's a simple solution out there!

Fiddle: https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/11431

Solution

For a single table with natural value progression there are simpler solutions. But for the combination of two tables with a (seemingly) arbitrary next promptId I expect a recursive CTE to perform best:

WITH RECURSIVE cte AS (
   SELECT CURRENT_DATE AS the_day, p."authorId" AS author_id
   FROM   prompt pt
   JOIN   post  p ON p."promptId" = pt.id
   WHERE  pt."dateKey" = CURRENT_DATE
   AND    p."authorId" = 90  -- your author here!
   
   UNION ALL
   SELECT c.the_day - 1, p."authorId"   -- assuming no gaps in prompt!
   FROM   cte   c
   JOIN   prompt pt ON pt."dateKey" = c.the_day - 1
   JOIN   post  p  ON p."promptId" = pt.id
   WHERE  p."authorId" = c.author_id
   )
SELECT count(*)
FROM   cte;

fiddle

Absolutely requires index support to be fast.
Ideally, one index on prompt("dateKey", id), and one on post("authorId", "promptId").

Assuming ...

... we query for one given user,
... we start "today",
... no gaps in prompt - exactly one entry per day,
... prompt."dateKey" is type date (as it should be).

WITH RECURSIVE query to choose the longest paths

DB design

If there is at most one prompt per day, consider using the date (data type date!) as PK in table prompt and FK in table post. Allows for much simpler queries. See:

Select longest continuous sequence