Search code examples
rdata.tablesummarize

Is there a way in data.table to collapse/summarize variables using objects, a la .SDcols?


I have a dataset with multiple variables that I want to collapse/summarize (depending on whether your background is more Stata or tidyverse) using data.table without having to name each individual variable in the code that does this.

Here's some example data:

library(data.table)

dt <- data.table(v1 = c(1,2,5,8,5,9, NA),
                 v2 = c(5,3,6,1, NA,7,8),
                 year = c(1,1,2,3,3,3,4))

dt
    v1 v2 year
1:  1  5    1
2:  2  3    1
3:  5  6    2
4:  8  1    3
5:  5 NA    3
6:  9  7    3
7: NA  8    4

And here's the final dataset I want:

# this is the final data we want:
dt[, .(newv1 = sum(v1, na.rm = TRUE),
       newv2 = sum(v2, na.rm = TRUE)),
   by = 'year']

    year newv1 newv2
1:    1     3     8
2:    2     5     6
3:    3    22     8
4:    4     0     8

The actual dataset has many variables I'd like to summarize, so I want to name the variables in a list outside of the collapsing beforehand and then name them in a systematic way, such as:

# but we want to do it with objects e.g.:
vars.to.collapse <- c('v1', 'v2')
new.v.names <- paste0('new', vars.to.collapse)

new.v.names
[1] "newv1" "newv2"

I know you can do this kind of thing--that is, create multiple variables using objects--using .SD when you're adding/modifying variables using the := operator, but I have not been able to find a way to do this while altering the unit of observation. In an ideal world I'd use code something like the following:

# want something like this but it doesn't work:
newdt <- dt[, .( (new.v.names = sum(.SD, na.rm = TRUE))),
              .SDcols = vars.to.collapse,
              by = 'year']

# not what I want:
newdt
     year V1
1:    1 11
2:    2 11
3:    3 30
4:    4  8

But this doesn't produce the dataset I showed you above with the new variable names and all. I could work around this by using the := operator to add these statistics to the dataset and then drop duplicates or something, but I'd prefer to summarize directly if possible.


Solution

  • To follow up on your comment about naming new variables, I suggest two solutions using the data.table library.

    So, please find below the two reprex.

    Reprex 1 (as a follow-up to the solution proposed by @Roland)

    • Code
    library(data.table)
    
    vars.to.collapse <- c('v1', 'v2') # your code
    new.v.names <- paste0('new', vars.to.collapse) # your code
    
    dt <- dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'] # Roland's code
    
    setnames(dt, c("year", new.v.names))
    
    • Output
    dt
    #>     year newv1 newv2
    #> 1:     1     3     8
    #> 2:     2     5     6
    #> 3:     3    22     8
    #> 4:     4     0     8
    

    Reprex 2

    • Code
    library(data.table)
    
    vars.to.collapse <- c('v1', 'v2')  # your code
    new.v.names <- paste0('new', vars.to.collapse) # your code
    
    dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'
       ][, (new.v.names) := .SD, .SDcols = vars.to.collapse
         ][, .SD, .SDcols = !patterns("^v")][]
    
    • Output
    #>    year newv1 newv2
    #> 1:    1     3     8
    #> 2:    2     5     6
    #> 3:    3    22     8
    #> 4:    4     0     8
    

    Created on 2021-11-17 by the reprex package (v2.0.1)