Search code examples
sqlpostgresqlquery-optimizationwindow-functionsgaps-and-islands

How to calculate a user's daily streak?


My goal

I'm trying to track and display a user's daily streak posting to my app, but struggling to write a query that works reliably and returns an accurate count.

Some context

My app has a prompt and a post table. A user can submit one post for each prompt (the prompts are created daily, so one post per day, per user).

The prompt table, simplified, looks like this:

id dateKey text
1 20240101 This is an example prompt.
2 20240102 Here is a second prompt.

The post table, simplified, is something like:

id content promptId authorId
50 This is my response to the prompt. 1 90
51 A second response to the same prompt. 1 91

Current query

I've tried a few different approaches to queries (using PARTITION BY, dense_rank() etc.) but could only get the user's longest streak. The query I had if you're interested:

select distinct on (p."authorId") count(distinct "dateKey"::date) as "streak"
from (select p.*,
      dense_rank() over (partition by p."authorId" order by "dateKey"::date) as seq
      from post p
      join prompt pt on p."promptId" = pt.id
     ) p
join prompt pt on p."promptId" = pt.id
where p."authorId" = 90
group by p."authorId", "dateKey"::date - seq * interval '1 day'
order by p."authorId", streak desc

This appears to work with the below data, but if you add a new "missed" prompt (which should reset the streak), this query would still return 2 (I think I understand why, but not sure how to correct it).

What I need

I basically need something to start at the latest prompt, and go down the list until it finds a prompt with no post from that user.

For example, this joined data would have a streak of 2:

id dateKey text postContent authorId
1 20240104 This is an example prompt. This is my response. 90
2 20240103 Here is a second prompt. A second response. 90
3 20240102 A third prompt. null null
4 20240101 My fourth prompt. A third response, but I missed a day. 90

Functionality most important, but would be nice if it were performant too (prompt might have 1000 rows, post might have millions, and the streak might reach 1000).

I'm a bit lost with PostgreSQL's capabilities for things like this, so hopefully there's a simple solution out there!

Fiddle: https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/11431


Solution

  • For a single table with natural value progression there are simpler solutions. But for the combination of two tables with a (seemingly) arbitrary next promptId I expect a recursive CTE to perform best:

    WITH RECURSIVE cte AS (
       SELECT CURRENT_DATE AS the_day, p."authorId" AS author_id
       FROM   prompt pt
       JOIN   post  p ON p."promptId" = pt.id
       WHERE  pt."dateKey" = CURRENT_DATE
       AND    p."authorId" = 90  -- your author here!
       
       UNION ALL
       SELECT c.the_day - 1, p."authorId"   -- assuming no gaps in prompt!
       FROM   cte   c
       JOIN   prompt pt ON pt."dateKey" = c.the_day - 1
       JOIN   post  p  ON p."promptId" = pt.id
       WHERE  p."authorId" = c.author_id
       )
    SELECT count(*)
    FROM   cte;
    

    fiddle

    Absolutely requires index support to be fast.
    Ideally, one index on prompt("dateKey", id), and one on post("authorId", "promptId").

    Assuming ...

    • ... we query for one given user,
    • ... we start "today",
    • ... no gaps in prompt - exactly one entry per day,
    • ... prompt."dateKey" is type date (as it should be).

    Related:

    DB design

    If there is at most one prompt per day, consider using the date (data type date!) as PK in table prompt and FK in table post. Allows for much simpler queries. See: