Search code examples
sqldatabasepostgresqlranking

Return Largest Chunk of Consecutive Days Per User in PostgreSQL


I have a dataset that looks like this:

    respondent_id   day_session daydiff
nmo876  11/19/2017  0
nmo876  11/20/2017  1
nmo876  11/21/2017  1
nmo876  11/23/2017  2
nmo876  11/24/2017  1
nmo876  11/25/2017  1
nmo876  11/26/2017  1
nmo876  11/27/2017  1
nmo876  11/28/2017  1
nmo876  11/29/2017  1
nmo876  11/30/2017  1
nmo876  12/1/2017   1
nmo876  12/2/2017   1
nmo876  12/3/2017   1
nmo876  12/4/2017   1
nmo876  12/5/2017   1
nmo876  12/6/2017   1
nmo876  12/7/2017   1
nmo876  12/8/2017   1
nmo876  12/9/2017   1
nmo876  12/10/2017  1
nmo876  12/11/2017  1
nmo876  12/12/2017  1
nmo876  12/13/2017  1
nmo876  12/14/2017  1
nmo876  12/15/2017  1
nmo876  12/16/2017  1
nmo876  12/17/2017  1
nmo876  12/18/2017  1
nmo876  12/19/2017  1
nmo876  12/20/2017  1
nmo876  12/23/2017  3
nmo876  12/24/2017  1
nmo876  12/26/2017  2
nmo876  12/27/2017  1
nmo876  12/28/2017  1
nmo876  12/29/2017  1
nmo876  12/30/2017  1
nmo876  12/31/2017  1
nmo876  1/2/2018    2
nmo876  1/3/2018    1
nmo876  1/4/2018    1
nmo876  1/5/2018    1

I want to write a script that chooses the largest chunk of consecutive day_sessions from a dataset where a user might have multiple consecutive day_sessions, i.e. where daydiff = 1. For nmo876 the output would be 27.

Here is more data for which the code should calculate the largest chunk of consecutive daily sessions. For user jkl567 the output would be 37:

jkl567  11/19/2017  1
jkl567  11/20/2017  1
jkl567  11/21/2017  1
jkl567  11/22/2017  1
jkl567  11/23/2017  1
jkl567  11/24/2017  1
jkl567  11/25/2017  1
jkl567  11/26/2017  1
jkl567  11/27/2017  1
jkl567  11/28/2017  1
jkl567  11/29/2017  1
jkl567  11/30/2017  1
jkl567  12/1/2017   1
jkl567  12/2/2017   1
jkl567  12/3/2017   1
jkl567  12/4/2017   1
jkl567  12/5/2017   1
jkl567  12/6/2017   1
jkl567  12/7/2017   1
jkl567  12/8/2017   1
jkl567  12/9/2017   1
jkl567  12/10/2017  1
jkl567  12/11/2017  1
jkl567  12/12/2017  1
jkl567  12/13/2017  1
jkl567  12/14/2017  1
jkl567  12/15/2017  1
jkl567  12/16/2017  1
jkl567  12/17/2017  1
jkl567  12/18/2017  1
jkl567  12/19/2017  1
jkl567  12/20/2017  1
jkl567  12/21/2017  1
jkl567  12/22/2017  1
jkl567  12/23/2017  1
jkl567  12/24/2017  1
jkl567  12/25/2017  1
jkl567  12/26/2017  2
jkl567  12/28/2017  1
jkl567  12/29/2017  3
jkl567  1/1/2018    1
jkl567  1/2/2018    1
jkl567  1/3/2018    1
jkl567  1/4/2018    1

Solution

  • You can subtract row_number() to get a constant value that defines the group. To get the length of each group:

    select respondent_id, (day_session - seqnum * interval '1 day') as grp, count(*) as days_in_row
    from (select t.*,
                 row_number() over (partition by respondent_id order by day_session) as seqnum
          from t
         ) t
    group by respondent_id, (day_session - seqnum * interval '1 day');
    

    You can then get the biggest for each respondent using distinct on. I use a subquery for this:

    select distinct on (respondent_id) t.*
    from (select respondent_id, (day_session - seqnum * interval '1 day') as grp, count(*) as days_in_row
          from (select t.*,
                       row_number() over (partition by respondent_id order by day_session) as seqnum
                from t
               ) t
          group by respondent_id, (day_session - seqnum * interval '1 day')
         ) t
    order by respondent_id, days_in_row desc;
    

    Strictly speaking, the subquery is not necessary. I just find it easier to break out the logic that way.