Search code examples
rdplyrretention

calculating simple retention in R


For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.

> test
   user_id period
1        1      1
2        5      1
3        1      1
4        3      1
5        4      1
6        2      2
7        3      2
8        2      2
9        3      2
10       1      2
11       5      3
12       5      3
13       2      3
14       1      3
15       4      3
16       5      4
17       5      4
18       5      4
19       4      4
20       3      4

For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.

The output would be the following:

> output
  period retention
1      1        NA
2      2     0.500
3      3     0.666
4      4     0.500

The test data:

> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5, 
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")

Solution

  • This isn't so elegant but it seems to work. Assuming df is the data frame:

    # make a list to hold unique IDS by 
    uniques = list()
    for(i in 1:max(df$period)){
      uniques[[i]] = unique(df$user_id[df$period == i])
    }
    
    # hold the retention rates
    retentions = rep(NA, times = max(df$period))
    
    for(j in 2:max(df$period)){
      retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
    }
    

    Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.