Search code examples
rdata.tablegroupinglag

Data table shift by group error


I have a very large data set where I want to to take the lagged value of a row for the same by group. This functionality is well documented in the data.table package and I've seen other examples of it. However with my data I keep getting an error and I cannot figure out why.

According to documentation, this is the process. I can run this and it works.

library(data.table)
DT = data.table(year=rep(2010:2011, each=3), v1=1:6)
DT[, c("lag1", "lag2") := shift(.SD, 1:2), by=year]

Now here is a sample of my data.

unique_id click_time
100005361 2017-11-09 03:58:32
100005371 2017-11-09 00:53:08
100005371 2017-11-09 04:38:52
100005371 2017-11-09 05:42:30
100005371 2017-11-09 05:42:31
100005371 2017-11-09 05:42:31
100005371 2017-11-09 05:42:32
100005371 2017-11-09 05:42:38
100005371 2017-11-09 05:42:51
100005371 2017-11-09 09:46:57
100005371 2017-11-09 09:46:58
100005371 2017-11-09 09:46:58
100005371 2017-11-09 09:47:00
100005371 2017-11-09 11:37:26
100005391 2017-11-09 05:04:36
100005391 2017-11-09 05:04:41
10000541 2017-11-09 10:55:35
100005411 2017-11-08 22:55:49
100005411 2017-11-08 22:56:14

There are 5 unique ids here. I want to use the lag of click time and join it to each row where applicable. Obviously if there's just 1 row or it's the 1st row then there wont be any lag. Applying the above code should be a one liner. This what I tried.

DT[, c("last_click_time") := shift(.SD), by=unique_id]

This is following the syntax from the documentation and from links I've seen elsewhere such as here: How to create a lag variable within each group?

I get an error though that I can't figure out:

Error in `[.data.frame`(DT, , `:=`(c("last_click_time"), shift(.SD)),  : 
  unused argument (by = unique_id)

Even if I replace the timestamps to something simpler, the error persists.

DT$click_time = seq(1,2000,length.out = 19)

Any idea why this error is popping up? This is driving me mad!

EDIT: The answer to this question is I am an idiot. If you try this on a data.frame() object you will get the error I showed. If you convert to a data.table() then the code works as it should. Embarrassing but hopefully this post will be useful to someone else later on.


Solution

  • Your example seems to work for me, although I had to tidy the data a little to make sure it reads in correctly:

    library( data.table )
    
    DT <- fread( 'unique_id,click_time
    100005361,2017-11-09 03:58:32
    100005371,2017-11-09 00:53:08
    100005371,2017-11-09 04:38:52
    100005371,2017-11-09 05:42:30
    100005371,2017-11-09 05:42:31
    100005371,2017-11-09 05:42:31
    100005371,2017-11-09 05:42:32
    100005371,2017-11-09 05:42:38
    100005371,2017-11-09 05:42:51
    100005371,2017-11-09 09:46:57
    100005371,2017-11-09 09:46:58
    100005371,2017-11-09 09:46:58
    100005371,2017-11-09 09:47:00
    100005371,2017-11-09 11:37:26
    100005391,2017-11-09 05:04:36
    100005391,2017-11-09 05:04:41
    10000541,2017-11-09 10:55:35
    100005411,2017-11-08 22:55:49
    100005411,2017-11-08 22:56:14' )
    
    DT[, c("last_click_time") := shift(.SD), by=unique_id]
    

    The result:

    > head( DT )
    
       unique_id          click_time     last_click_time
    1: 100005361 2017-11-09 03:58:32                  NA
    2: 100005371 2017-11-09 00:53:08                  NA
    3: 100005371 2017-11-09 04:38:52 2017-11-09 00:53:08
    4: 100005371 2017-11-09 05:42:30 2017-11-09 04:38:52
    5: 100005371 2017-11-09 05:42:31 2017-11-09 05:42:30
    6: 100005371 2017-11-09 05:42:31 2017-11-09 05:42:31
    

    I'm note sure whether it's best practice or not, but I personally avoid using .SD unless it's really necessary. Here you can go without. I'd use this, which should give an identical result to the above:

    DT[, last_click_time := shift(click_time), by=unique_id]