Search code examples
rdata.tabledplyrdistinct-values

R: calculate number of distinct categories in the specified time frame


here's some dummy data:

  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple

I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order

The solution would look like this:

 user_id       date category distinct_7 distinct_14
      27 2016-01-01    apple          1           1
      27 2016-01-03    apple          1           1
      27 2016-01-05     pear          2           2
      27 2016-01-07     plum          3           3
      27 2016-01-10    apple          3           3
      27 2016-01-14     pear          3           3
      27 2016-01-16     plum          3           3
      11 2016-01-01    apple          1           1
      11 2016-01-03     pear          2           2
      11 2016-01-05     pear          2           2
      11 2016-01-07     pear          2           2
      11 2016-01-10    apple          2           2
      11 2016-01-14    apple          2           2
      11 2016-01-16    apple          1           2

I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!


Solution

  • In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.

    library(tidyverse)
    
    df %>% group_by(user_id) %>% 
        mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
               distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))
    
    ## Source: local data frame [14 x 5]
    ## Groups: user_id [2]
    ## 
    ##    user_id       date category distinct_7 distinct_14
    ##      <int>     <date>   <fctr>      <int>       <int>
    ## 1       27 2016-01-01    apple          1           1
    ## 2       27 2016-01-03    apple          1           1
    ## 3       27 2016-01-05     pear          2           2
    ## 4       27 2016-01-07     plum          3           3
    ## 5       27 2016-01-10    apple          3           3
    ## 6       27 2016-01-14     pear          3           3
    ## 7       27 2016-01-16     plum          3           3
    ## 8       11 2016-01-01    apple          1           1
    ## 9       11 2016-01-03     pear          2           2
    ## 10      11 2016-01-05     pear          2           2
    ## 11      11 2016-01-07     pear          2           2
    ## 12      11 2016-01-10    apple          2           2
    ## 13      11 2016-01-14    apple          2           2
    ## 14      11 2016-01-16    apple          1           2