I have a very large data set where I want to to take the lagged value of a row for the same by group. This functionality is well documented in the data.table package and I've seen other examples of it. However with my data I keep getting an error and I cannot figure out why.
According to documentation, this is the process. I can run this and it works.
library(data.table)
DT = data.table(year=rep(2010:2011, each=3), v1=1:6)
DT[, c("lag1", "lag2") := shift(.SD, 1:2), by=year]
Now here is a sample of my data.
unique_id click_time
100005361 2017-11-09 03:58:32
100005371 2017-11-09 00:53:08
100005371 2017-11-09 04:38:52
100005371 2017-11-09 05:42:30
100005371 2017-11-09 05:42:31
100005371 2017-11-09 05:42:31
100005371 2017-11-09 05:42:32
100005371 2017-11-09 05:42:38
100005371 2017-11-09 05:42:51
100005371 2017-11-09 09:46:57
100005371 2017-11-09 09:46:58
100005371 2017-11-09 09:46:58
100005371 2017-11-09 09:47:00
100005371 2017-11-09 11:37:26
100005391 2017-11-09 05:04:36
100005391 2017-11-09 05:04:41
10000541 2017-11-09 10:55:35
100005411 2017-11-08 22:55:49
100005411 2017-11-08 22:56:14
There are 5 unique ids here. I want to use the lag of click time and join it to each row where applicable. Obviously if there's just 1 row or it's the 1st row then there wont be any lag. Applying the above code should be a one liner. This what I tried.
DT[, c("last_click_time") := shift(.SD), by=unique_id]
This is following the syntax from the documentation and from links I've seen elsewhere such as here: How to create a lag variable within each group?
I get an error though that I can't figure out:
Error in `[.data.frame`(DT, , `:=`(c("last_click_time"), shift(.SD)), :
unused argument (by = unique_id)
Even if I replace the timestamps to something simpler, the error persists.
DT$click_time = seq(1,2000,length.out = 19)
Any idea why this error is popping up? This is driving me mad!
EDIT: The answer to this question is I am an idiot. If you try this on a data.frame() object you will get the error I showed. If you convert to a data.table() then the code works as it should. Embarrassing but hopefully this post will be useful to someone else later on.
Your example seems to work for me, although I had to tidy the data a little to make sure it reads in correctly:
library( data.table )
DT <- fread( 'unique_id,click_time
100005361,2017-11-09 03:58:32
100005371,2017-11-09 00:53:08
100005371,2017-11-09 04:38:52
100005371,2017-11-09 05:42:30
100005371,2017-11-09 05:42:31
100005371,2017-11-09 05:42:31
100005371,2017-11-09 05:42:32
100005371,2017-11-09 05:42:38
100005371,2017-11-09 05:42:51
100005371,2017-11-09 09:46:57
100005371,2017-11-09 09:46:58
100005371,2017-11-09 09:46:58
100005371,2017-11-09 09:47:00
100005371,2017-11-09 11:37:26
100005391,2017-11-09 05:04:36
100005391,2017-11-09 05:04:41
10000541,2017-11-09 10:55:35
100005411,2017-11-08 22:55:49
100005411,2017-11-08 22:56:14' )
DT[, c("last_click_time") := shift(.SD), by=unique_id]
The result:
> head( DT )
unique_id click_time last_click_time
1: 100005361 2017-11-09 03:58:32 NA
2: 100005371 2017-11-09 00:53:08 NA
3: 100005371 2017-11-09 04:38:52 2017-11-09 00:53:08
4: 100005371 2017-11-09 05:42:30 2017-11-09 04:38:52
5: 100005371 2017-11-09 05:42:31 2017-11-09 05:42:30
6: 100005371 2017-11-09 05:42:31 2017-11-09 05:42:31
I'm note sure whether it's best practice or not, but I personally avoid using .SD
unless it's really necessary. Here you can go without. I'd use this, which should give an identical result to the above:
DT[, last_click_time := shift(click_time), by=unique_id]