I have a dataset that looks like this:
respondent_id day_session daydiff
nmo876 11/19/2017 0
nmo876 11/20/2017 1
nmo876 11/21/2017 1
nmo876 11/23/2017 2
nmo876 11/24/2017 1
nmo876 11/25/2017 1
nmo876 11/26/2017 1
nmo876 11/27/2017 1
nmo876 11/28/2017 1
nmo876 11/29/2017 1
nmo876 11/30/2017 1
nmo876 12/1/2017 1
nmo876 12/2/2017 1
nmo876 12/3/2017 1
nmo876 12/4/2017 1
nmo876 12/5/2017 1
nmo876 12/6/2017 1
nmo876 12/7/2017 1
nmo876 12/8/2017 1
nmo876 12/9/2017 1
nmo876 12/10/2017 1
nmo876 12/11/2017 1
nmo876 12/12/2017 1
nmo876 12/13/2017 1
nmo876 12/14/2017 1
nmo876 12/15/2017 1
nmo876 12/16/2017 1
nmo876 12/17/2017 1
nmo876 12/18/2017 1
nmo876 12/19/2017 1
nmo876 12/20/2017 1
nmo876 12/23/2017 3
nmo876 12/24/2017 1
nmo876 12/26/2017 2
nmo876 12/27/2017 1
nmo876 12/28/2017 1
nmo876 12/29/2017 1
nmo876 12/30/2017 1
nmo876 12/31/2017 1
nmo876 1/2/2018 2
nmo876 1/3/2018 1
nmo876 1/4/2018 1
nmo876 1/5/2018 1
I want to write a script that chooses the largest chunk of consecutive day_sessions from a dataset where a user might have multiple consecutive day_sessions, i.e. where daydiff = 1. For nmo876 the output would be 27.
Here is more data for which the code should calculate the largest chunk of consecutive daily sessions. For user jkl567 the output would be 37:
jkl567 11/19/2017 1
jkl567 11/20/2017 1
jkl567 11/21/2017 1
jkl567 11/22/2017 1
jkl567 11/23/2017 1
jkl567 11/24/2017 1
jkl567 11/25/2017 1
jkl567 11/26/2017 1
jkl567 11/27/2017 1
jkl567 11/28/2017 1
jkl567 11/29/2017 1
jkl567 11/30/2017 1
jkl567 12/1/2017 1
jkl567 12/2/2017 1
jkl567 12/3/2017 1
jkl567 12/4/2017 1
jkl567 12/5/2017 1
jkl567 12/6/2017 1
jkl567 12/7/2017 1
jkl567 12/8/2017 1
jkl567 12/9/2017 1
jkl567 12/10/2017 1
jkl567 12/11/2017 1
jkl567 12/12/2017 1
jkl567 12/13/2017 1
jkl567 12/14/2017 1
jkl567 12/15/2017 1
jkl567 12/16/2017 1
jkl567 12/17/2017 1
jkl567 12/18/2017 1
jkl567 12/19/2017 1
jkl567 12/20/2017 1
jkl567 12/21/2017 1
jkl567 12/22/2017 1
jkl567 12/23/2017 1
jkl567 12/24/2017 1
jkl567 12/25/2017 1
jkl567 12/26/2017 2
jkl567 12/28/2017 1
jkl567 12/29/2017 3
jkl567 1/1/2018 1
jkl567 1/2/2018 1
jkl567 1/3/2018 1
jkl567 1/4/2018 1
You can subtract row_number()
to get a constant value that defines the group. To get the length of each group:
select respondent_id, (day_session - seqnum * interval '1 day') as grp, count(*) as days_in_row
from (select t.*,
row_number() over (partition by respondent_id order by day_session) as seqnum
from t
) t
group by respondent_id, (day_session - seqnum * interval '1 day');
You can then get the biggest for each respondent using distinct on
. I use a subquery for this:
select distinct on (respondent_id) t.*
from (select respondent_id, (day_session - seqnum * interval '1 day') as grp, count(*) as days_in_row
from (select t.*,
row_number() over (partition by respondent_id order by day_session) as seqnum
from t
) t
group by respondent_id, (day_session - seqnum * interval '1 day')
) t
order by respondent_id, days_in_row desc;
Strictly speaking, the subquery is not necessary. I just find it easier to break out the logic that way.