Search code examples
rrolling-computationdbplyr

rolling count of distinct users


I want to calculate the rolling count of unique users with variable time windows. Here's an example of what I have and the outcome I want.

have <- data.frame(user = c(1, 2, 
                            2, 3, 
                            1, 2, 3, 
                            4, 
                            3, 4,
                            4),
                   when = lubridate::ymd("2020-01-01",
                                         "2020-01-01",
                                         "2020-01-02",
                                         "2020-01-02",
                                         "2020-01-03",
                                         "2020-01-03",
                                         "2020-01-03",
                                         "2020-01-05",
                                         "2020-01-06",
                                         "2020-01-06",
                                         "2020-01-07"))
have 
#   user       when
#1     1 2020-01-01
#2     2 2020-01-01
#3     2 2020-01-02
#4     3 2020-01-02
#5     1 2020-01-03
#6     2 2020-01-03
#7     3 2020-01-03 # note that Jan 4 is missing
#8     4 2020-01-05
#9     3 2020-01-06
#10    4 2020-01-06
#11    4 2020-01-07

want <- data.frame(when=c("2020-01-01",
                          "2020-01-02",
                          "2020-01-03",
                          "2020-01-04",
                          "2020-01-05",
                          "2020-01-06",
                          "2020-01-07"),
                   twoDayCount=c(2, # Jan 1: 1, 2
                                 3, # Jan 1-2: 1, 2, 3
                                 3, # Jan 2-3: 1, 2, 3
                                 3, # Jan 3-4: 1, 2, 3
                                 1, # Jan 4-5: 4
                                 2, # Jan 5-6: 3, 4
                                 2  # Jan 6-7: 3, 4
                                 )
                   )
want
#        when twoDayCount
#1 2020-01-01           2 # users: 1, 2
#2 2020-01-02           3 # users: 1, 2, 3
#3 2020-01-03           3 # users: 1, 2, 3
#4 2020-01-04           3 # users: 1, 2, 3
#5 2020-01-05           1 # users: 4
#6 2020-01-06           2 # users: 3, 4
#7 2020-01-07           2 # users: 3, 4

I've tried a few approaches but they have me counting all rows per window, not distinct users per window. For instance, the desired 2-day unique user count on Jan 3 is 3 (users 1, 2, 3), not 5 rows (with users 2 and 3 appearing twice each).

My actual use case needs the rolling window period (2 days in this example) to be an input.

Ideally the solution works with functions that {dbplyr} can translate to sql or via native sql that can be run with {dbplyr}.

This answer gives an idea for how to solve with sql:

SELECT when, count(DISTINCT user) AS dist_users 
FROM  (SELECT generate_series('2020-01-01'::date, '2020-01-07'::date, '1d')::date) AS g(when) 
LEFT   JOIN tbl t ON t.when BETWEEN g.when - 2 AND g.when 
GROUP  BY 1 
ORDER  BY 1;

Solution

  • Using functions from dplyr and tidyr, for the 1-day window case:

    have %>% 
      group_by(when) %>% 
      summarise(twoDayCount = n_distinct(user))
    

    For larger windows:

    window <- 2
    have %>% 
      rowwise() %>% 
      mutate(when = list(when + lubridate::days(0:(window - 1)))) %>% 
      unnest(cols = when) %>%
      group_by(when) %>% 
      summarise(twoDayCount = n_distinct(user))
    

    Note that this method will give you rows for a few later dates (in this case Jan 08), which you might want to remove.

    If performance is an issue for larger datasets, here is a much faster (but slightly less elegant) solution:

    window <- 2
    seq.Date(min(have$when), max(have$when), by = "day") %>% 
      purrr::map(function(date) {
        have %>% 
            filter(when <= date, when >= date - days(window - 1))  %>%
            summarise(userCount = n_distinct(user)) %>%
            mutate(when = date)
        }) %>% 
      bind_rows()