Search code examples
rdata.tabledynamic-columns

R: data.table .dynamic aggregations on column Date columns


I am trying to do a min/max aggregate on a dynamically chosen column in a data.table. It works perfectly for numeric columns but I cannot get it to work on Date columns unless I create a temporary data.table.

It works when I use the name:

dt <- data.table(Index=1:31, Date = seq(as.Date('2015-01-01'), as.Date('2015-01-31'), by='days'))
dt[, .(minValue = min(Date), maxValue = max(Date))]
# minValue   maxValue
# 1: 2015-01-01 2015-01-31

It does not work when I use with=FALSE:

colName = 'Date'
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)),  : 
# could not find function "."

I can use .SDcols on a numeric column:

colName = 'Index'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
#   minValue maxValue
#  1:        1       31

But I get an error when I do the same thing for a Date column:

colName = 'Date'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) : 
#   only defined on a data frame with all numeric variables

If I use lapply(.SD, min) or sapply() then the dates are changed to numbers.

The following works and does not seem to waste memory and is fast. Is there anything better?

a <- dt[, colName, with=F]
setnames(a, 'a')
a[, .(minValue = min(a), maxValue = max(a))]

Solution

  • On your first attempt:

    dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
    # Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)),  : 
    # could not find function "."
    

    You should simply read the Introduction to data.table vignette to understand what with= means. It's easier if you're aware of with() function from base R.

    On the second one:

    dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
    # Error in FUN(X[[i]], ...) : 
    #   only defined on a data frame with all numeric variables
    

    This seems like an issue with min() and max() on a data.frame/data.table with column with attributes. Here's a MRE.

    df = data.frame(x=as.Date("2015-01-01"))
    min(df)
    # Error in FUN(X[[i]], ...) : 
    #   only defined on a data frame with all numeric variables
    

    To answer your question, you can use get():

    dt[, .(min = min(get(colName)), max = max(get(colName)))]
    

    Or as @Frank suggested, [[ operator to subset the column:

    dt[, .(min = min(.SD[[colName]]), max = max(.SD[[colName]]))]
    

    There's not yet a nicer way of applying .SD to multiple functions (because base R doesn't seem to have one AFAICT, and data.table tries to use base R functions as much as possible). There's a FR #1063 to address this issue. If/when that gets implemented, then one could do, for example:

    # NOTE: not yet implemented, FR #1063
    dt[, colwise(.SD, min, max), .SDcols = colName]