I have a dataset with multiple variables that I want to collapse/summarize (depending on whether your background is more Stata or tidyverse) using data.table
without having to name each individual variable in the code that does this.
Here's some example data:
library(data.table)
dt <- data.table(v1 = c(1,2,5,8,5,9, NA),
v2 = c(5,3,6,1, NA,7,8),
year = c(1,1,2,3,3,3,4))
dt
v1 v2 year
1: 1 5 1
2: 2 3 1
3: 5 6 2
4: 8 1 3
5: 5 NA 3
6: 9 7 3
7: NA 8 4
And here's the final dataset I want:
# this is the final data we want:
dt[, .(newv1 = sum(v1, na.rm = TRUE),
newv2 = sum(v2, na.rm = TRUE)),
by = 'year']
year newv1 newv2
1: 1 3 8
2: 2 5 6
3: 3 22 8
4: 4 0 8
The actual dataset has many variables I'd like to summarize, so I want to name the variables in a list outside of the collapsing beforehand and then name them in a systematic way, such as:
# but we want to do it with objects e.g.:
vars.to.collapse <- c('v1', 'v2')
new.v.names <- paste0('new', vars.to.collapse)
new.v.names
[1] "newv1" "newv2"
I know you can do this kind of thing--that is, create multiple variables using objects--using .SD
when you're adding/modifying variables using the :=
operator, but I have not been able to find a way to do this while altering the unit of observation. In an ideal world I'd use code something like the following:
# want something like this but it doesn't work:
newdt <- dt[, .( (new.v.names = sum(.SD, na.rm = TRUE))),
.SDcols = vars.to.collapse,
by = 'year']
# not what I want:
newdt
year V1
1: 1 11
2: 2 11
3: 3 30
4: 4 8
But this doesn't produce the dataset I showed you above with the new variable names and all. I could work around this by using the :=
operator to add these statistics to the dataset and then drop duplicates or something, but I'd prefer to summarize directly if possible.
To follow up on your comment about naming new variables, I suggest two solutions using the data.table
library.
So, please find below the two reprex.
Reprex 1 (as a follow-up to the solution proposed by @Roland)
library(data.table)
vars.to.collapse <- c('v1', 'v2') # your code
new.v.names <- paste0('new', vars.to.collapse) # your code
dt <- dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'] # Roland's code
setnames(dt, c("year", new.v.names))
dt
#> year newv1 newv2
#> 1: 1 3 8
#> 2: 2 5 6
#> 3: 3 22 8
#> 4: 4 0 8
Reprex 2
library(data.table)
vars.to.collapse <- c('v1', 'v2') # your code
new.v.names <- paste0('new', vars.to.collapse) # your code
dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'
][, (new.v.names) := .SD, .SDcols = vars.to.collapse
][, .SD, .SDcols = !patterns("^v")][]
#> year newv1 newv2
#> 1: 1 3 8
#> 2: 2 5 6
#> 3: 3 22 8
#> 4: 4 0 8
Created on 2021-11-17 by the reprex package (v2.0.1)