Search code examples
rdata.tablelapplysummarizationgroup-summaries

data.table: Using with=False and transforming function/summary function?


I want to summarise several variables in data.table, output in wide format, output possibly as a list per variable. Since several other approaches did not work, I tried to do an outer lapply, giving the names of the variables as character vectors. I wanted to pass these in, using with=FALSE.

carsx=as.data.table(cars)
lapply( list(speed="speed",dist= "dist"), #error object 'ansvals' not found
    function(x)  carsx[,list(mean(x), min(x), max(x) ), with=FALSE ] ) 

Since this does not work, I tried the more simple approach without lapply.

carsx[,list(mean("speed"), min("speed"), max("speed") ), with=FALSE ] #error object 'ansvals' not found

This does not work either. Is there any way to do something like this? Is this behaviour of 'with' wanted? (I am aware that ?data.table mentions with only to select columns, but in my case it would be useful to be able to transform them as well)

When with=FALSE, j is a vector of names or positions to select, similar to a data.frame. with=FALSE is often useful in data.table to select columns dynamically.

EDIT My aim is to get a summary per group in wide format, for different variables. I tried to extend the following, which works only for one variable, for a list of variables.

carsx[,list(mean(speed), min(speed), max(speed) ) ,by=(dist>50)

Lamentably SO doesnt let me post my other question. There I described that I want an output similiar to:

lapply( list(speed="speed",dist= "dist"),
        function(x) do.call("as.data.frame", aggregate(cars[,x], list(class=cars$dist>50), FUN=summary) ) )

Expected Output would be something like:

$speed 
         V1       V2 V3
1: FALSE 12.96970  4 20
2:  TRUE 20.11765 14 25

$dist
         V1       V2 V3
1: FALSE 12.96970  4 20
2:  TRUE 20.11765 14 25

Solution

  • Building on Svens answer a combination of .SDcols, rbindlist, and outer and inner lapply did the trick. The inner lapply is necessary to access .SD.

    lapply( list(speed="speed",dist= "dist"),
        function(x)  carsx[ , rbindlist(lapply(.SD, function(x) list(mean=mean(x), min=min(x), max=max(x)) )), 
                           .SDcols = x,by= (dist>50)] ) 
    

    Result:

    $speed
        dist     mean min max
    1: FALSE 12.96970   4  20
    2:  TRUE 20.11765  14  25
    
    $dist
        dist     mean min max
    1: FALSE 27.84848   2  50
    2:  TRUE 72.35294  52 120