Search code examples
rdata-analysis

Looking at Percent Change of Count per user


I have a problems that i've been wrestling with withs some time. I have a messy data frame I've managed to re-arrange into this format.

user_id  date     n
user1    1-1-15   60
user1    1-2-15   25
user1    1-5-15   20
user2    1-2-15   23
user2    1-4-15   15
user3    2-12-15  9
...      ...      ...

The user start and end dates don't all sync up and some aren't continuous. Ideally I'd like to find a percent change for each user over their lifetime. Something like:

user_id delta
user1   -%15
user2   -%27
user3    0%
...      ...

any help would be awesome.


Solution

  • As discussed in the comments, the dplyr option to find the change of last versus first 'n', grouped by the 'user_id' would be

    library(dplyr)
    df1 %>%
      group_by(user_id) %>%
      summarise(delta= 10*first(n)/(last(n)-first(n))) 
    

    Similar option using data.table will be to convert the 'data.frame' to 'data.table' (setDT(df1)), group by the 'user_id', we create the 'delta' column by calculating the change of last 'n' (n[.N]) with respect to first (n[1L]). We can also have a if/else to condition to return the calculated value if the number of elements within the group is greater than 1 (.N >1) or else to return 0.

    library(data.table)
    setDT(df1)[, list(delta=if(.N>1) 10*n[1L]/(n[.N]-n[1L]) else 0), by = user_id]
    

    Suppose if we need to do the lm for each 'user_id', we can get the slope within the do environment. Just like in the previous case, we can create an if/else condition to return slope value if the number of elements is greater than 1 or else to return 0.

    df1 %>%
       group_by(user_id) %>%
       mutate(N= n()) %>%
       do(data.frame(slope= if(.$N[1L] >1) coef(lm(.$n~.$date))[2] else 0))
    

    Or using data.table, we get the coef after grouping by 'user_id'

    setDT(df1)[, if(.N>1) coef(lm(n~date))[2] else 0 , user_id]
    

    data

    df1 <- structure(list(user_id = c("user1", "user1", "user1", "user2", 
    "user2", "user3"), date = c("1-1-15", "1-2-15", "1-5-15", "1-2-15", 
    "1-4-15", "2-12-15"), n = c(60L, 25L, 20L, 23L, 15L, 9L)),
    .Names = c("user_id", 
    "date", "n"), class = "data.frame", row.names = c(NA, -6L))