Search code examples
sqlpostgresqlwindow-functionsdate-arithmeticgaps-and-islands

How to add a running count to rows in a 'streak' of consecutive days


Thanks to Mike for the suggestion to add the create/insert statements.

create table test (
  pid integer not null,
  date date not null,
  primary key (pid, date)
);

insert into test values
  (1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');

I want to add a new column that is 'days in current streak' so the result would look like:

pid    | date      | in_streak
-------|-----------|----------
1      | 2014-10-1 | 1
1      | 2014-10-2 | 2
1      | 2014-10-3 | 3
1      | 2014-10-5 | 1
1      | 2014-10-7 | 1
2      | 2014-10-2 | 1
2      | 2014-10-3 | 2
2      | 2014-10-4 | 3
2      | 2014-10-6 | 1

I've been trying to use the answers from

but I can't work out how to use the dense_rank() trick with other window functions to get the right result.


Solution

  • Building on this table (not using the SQL keyword "date" as column name.):

    CREATE TABLE tbl(
      pid int
    , the_date date
    , PRIMARY KEY (pid, the_date)
    );
    

    Query:

    SELECT pid, the_date
         , row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
    FROM  (
       SELECT *
            , the_date - '2000-01-01'::date
            - row_number() OVER (PARTITION BY pid ORDER BY the_date) AS grp
       FROM   tbl
    ) sub
    ORDER  BY pid, the_date;
    

    Subtracting a date from another date yields an integer. Since you are looking for consecutive days, every next row would be greater by one. If we subtract row_number() from that, the whole streak ends up in the same group (grp) per pid. Then it's simple to deal out number per group.

    grp is calculated with two subtractions, which should be fastest. An equally fast alternative could be:

    the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp
    

    One multiplication, one subtraction. String concatenation and casting is more expensive. Test with EXPLAIN ANALYZE.

    Don't forget to partition by pid additionally in both steps, or you'll inadvertently mix groups that should be separated.

    Using a subquery, since that is typically faster than a CTE. There is nothing here that a plain subquery couldn't do.

    And since you mentioned it: dense_rank() is obviously not necessary here. Basic row_number() does the job.