I have a problems that i've been wrestling with withs some time. I have a messy data frame I've managed to re-arrange into this format.
user_id date n
user1 1-1-15 60
user1 1-2-15 25
user1 1-5-15 20
user2 1-2-15 23
user2 1-4-15 15
user3 2-12-15 9
... ... ...
The user start and end dates don't all sync up and some aren't continuous. Ideally I'd like to find a percent change for each user over their lifetime. Something like:
user_id delta
user1 -%15
user2 -%27
user3 0%
... ...
any help would be awesome.
As discussed in the comments, the dplyr
option to find the change of last
versus first
'n', grouped by the 'user_id' would be
library(dplyr)
df1 %>%
group_by(user_id) %>%
summarise(delta= 10*first(n)/(last(n)-first(n)))
Similar option using data.table
will be to convert the 'data.frame' to 'data.table' (setDT(df1)
), group by the 'user_id', we create the 'delta' column by calculating the change of last 'n' (n[.N]
) with respect to first (n[1L]
). We can also have a if/else
to condition to return the calculated value if the number of elements within the group is greater than 1 (.N >1
) or else
to return 0.
library(data.table)
setDT(df1)[, list(delta=if(.N>1) 10*n[1L]/(n[.N]-n[1L]) else 0), by = user_id]
Suppose if we need to do the lm
for each 'user_id', we can get the slope
within the do
environment. Just like in the previous case, we can create an if/else
condition to return slope value if the number of elements is greater than 1 or else to return 0.
df1 %>%
group_by(user_id) %>%
mutate(N= n()) %>%
do(data.frame(slope= if(.$N[1L] >1) coef(lm(.$n~.$date))[2] else 0))
Or using data.table
, we get the coef
after grouping by 'user_id'
setDT(df1)[, if(.N>1) coef(lm(n~date))[2] else 0 , user_id]
df1 <- structure(list(user_id = c("user1", "user1", "user1", "user2",
"user2", "user3"), date = c("1-1-15", "1-2-15", "1-5-15", "1-2-15",
"1-4-15", "2-12-15"), n = c(60L, 25L, 20L, 23L, 15L, 9L)),
.Names = c("user_id",
"date", "n"), class = "data.frame", row.names = c(NA, -6L))