I am trying to specify sessions in a click stream data. I group rows based on months and userId and try to create another variable session, that looks at diff_days column, and increase by on if thats > 0.00209 and stays as the previous value otherwise. So basically I am trying to create session variable and use the lag version on it at the same time. The fist row in a group is always session = 1.
So take for example this data is one of the groups from group_by:
ID Month diff_days
2 0 NA
2 0 0.0002
2 0 0.001
2 0 0.01
2 0 0.00034
2 0 0.1
2 0 0.3
2 0 0.00005
and I want to create session variable within each group like this:
ID Month diff_days session
2 0 NA 1
2 0 0.0002 1
2 0 0.001 1
2 0 0.01 2
2 0 0.00034 2
2 0 0.1 3
2 0 0.3 4
2 0 0.00005 4
The code that I am using and not giving the right answer:
data <- data %>% group_by(ID, Month)
%>% mutate(session = ifelse(row_number() == 1, 1 ,
ifelse(diff_days < 0.0209, lag(session) , lag(session) + 1))) %>% ungroup()
I have been struggling with this for quite some time so any help would be greatly appreciated.
Thanks!
We can use cumsum
on the logical vector after grouping by 'ID', 'Month'. Create a logical vector diff_days[-1] >= 0.00209
(removed the first observation which is NA
and appended TRUE
as the first one. Then, get the cumulative sum, so that for every TRUE value, it gets added 1.
data %>%
group_by(ID, Month) %>%
mutate(session = cumsum(c(TRUE, diff_days[-1] >= 0.00209)))
# A tibble: 8 x 4
# Groups: ID, Month [1]
# ID Month diff_days session
# <int> <int> <dbl> <int>
#1 2 0 NA 1
#2 2 0 0.0002 1
#3 2 0 0.001 1
#4 2 0 0.01 2
#5 2 0 0.00034 2
#6 2 0 0.1 3
#7 2 0 0.3 4
#8 2 0 0.00005 4