Search code examples
rdata.tablekaggle

Dot preceding parentheses, .( ), in data.table


I am not familiar with this df[, .(...), Col] notation. I apologize if I am missing something obvious, but I cannot find a reference to this notation style, though it looks very useful.

It appears to be implementing aggregation. Based on the location of this notation in the code below, I would expect it to come from R not from h2o but I have tried checking both to no avail.

The example is from a Kaggle competition and the code works (to reproduce it go here):

trainHex<-as.h2o(train[,.(
  dist   = mean(radardist_km, na.rm = T),
  refArea5   = mean(Ref_5x5_50th, na.rm = T),
  refArea9  = mean(Ref_5x5_90th, na.rm = T),
  meanRefcomp = mean(RefComposite,na.rm=T),
  meanRefcomp5 = mean(RefComposite_5x5_50th,na.rm=T),
  meanRefcomp9 = mean(RefComposite_5x5_90th,na.rm=T),
  zdr   = mean(Zdr, na.rm = T),
  zdr5   = mean(Zdr_5x5_50th, na.rm = T),
  zdr9   = mean(Zdr_5x5_90th, na.rm = T),
  target = log1p(mean(Expected)),
  meanRef = mean(Ref,na.rm=T),
  sumRef = sum(Ref,na.rm=T),
  records = .N,
  naCounts = sum(is.na(Ref))
),Id][records>naCounts,],destination_frame="train.hex")

I would love the documentation and/or a good explanation of this.


Solution

  • .() is a data.table convenience function, acting as a terse alias for list(). Complicating matters just a little bit (mostly for those, like you, trying to figure out what the heck that . does!) is the fact that it's only interpreted as such within the scope of a call to [.data.table().

    Here, from ?data.table:

     DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
     setkey(DT,x,y)             # 2-column key
    
     DT["a"]                    # join to 1st column of key
     DT[.("a")]                 # same, .() is an alias for list()
     DT[list("a")]              # same
    
     ## But note that *this* doesn't work (my addition --- not in ?data.table)
     .("a")
    

    See also the vignette Introduction to data.table:

    data.table also allows wrapping columns with .() instead of list(). It is an alias to list(); they both mean the same. Feel free to use whichever you prefer