I am still new to data.table
. My question is similar to this one and this one. The difference is that I want to compute weighted means for multiple variables, by groups, but using more than one weight for each mean.
Consider the following data.table
(the actual is much larger):
library(data.table)
set.seed(123456)
mydata <- data.table(CLID = rep("CNK", 10),
ITNUM = rep(c("First", "Second", "First", "First", "Second"), 2),
SATS = rep(c("Always", "Amost always", "Sometimes", "Never", "Always"), 2),
ASSETS = rep(c("0-10", "11-25", "26-100", "101-200", "MORE THAN 200"), 2),
AVGVALUE1 = rnorm(10, 10, 2),
AVGVALUE2 = rnorm(10, 10, 2),
WGT1 = rnorm(10, 3, 1),
WGT2 = rnorm(10, 3, 1),
WGT3 = rnorm(10, 3, 1))
#I set the key of the table to the variables I want to group by,
#so the output is sorted
setkeyv(mydata, c("CLID", "ITNUM", "SATS", "ASSETS"))
What I want to achieve is to compute the weighted means for AVGVALUE1
and AVGVALUE2
(and possibly more variables) by groups defined by ITNUM
, SATS
, ASSETS
using each one of the weight variables WGT1
, WGT2
, WGT3
(and possibly more of them). So for each of the variables I want to compute the weighted means I will have three weighted means by groups (or whatever the number of weights is).
I can do it for each variable separately, e.g.:
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.var <- "AVGVALUE1"
split.vars <- c("ITNUM", "SATS", "ASSETS")
mydata[ , Map(f = weighted.mean, x = .(get(avg.var)), w = mget(all.weights),
na.rm = TRUE), by = c(key(mydata)[1], split.vars)]
I add the first key variable in by
, although it is a constant, because I would like to have it as a column in the output. And I get:
CLID ITNUM SATS ASSETS V1 V2 V3
1: CNK First Always 0-10 11.66824 11.66819 11.66829
2: CNK First Never 101-200 11.37378 12.21008 11.60182
3: CNK First Sometimes 26-100 12.43004 13.13450 12.01330
4: CNK Second Always MORE THAN 200 12.32265 11.81613 12.56786
5: CNK Second Amost always 11-25 10.76556 11.34669 10.52458
However, with the actual data.table
, where I have much more columns to compute the weighted means for (as well as much more weights to use), it would be rather cumberosme to do it one by one. What I imagine is a function where the mean for each of the variables (AVGVALUE1
, AVGVALUE2
and so on) is computed with each of the weight variables (WGT1
, WGT2
, WGT3
and so on) and the output for each variable for which the weighted mean is computed is added to a list. I guess the list would be the best option because if all estimates are in the same output, the number of columns might be endless. So something like this:
[[1]]
CLID ITNUM SATS ASSETS V1 V2 V3
1: CNK First Always 0-10 11.66824 11.66819 11.66829
2: CNK First Never 101-200 11.37378 12.21008 11.60182
3: CNK First Sometimes 26-100 12.43004 13.13450 12.01330
4: CNK Second Always MORE THAN 200 12.32265 11.81613 12.56786
5: CNK Second Amost always 11-25 10.76556 11.34669 10.52458
[[2]]
CLID ITNUM SATS ASSETS V1 V2 V3
1: CNK First Always 0-10 9.132899 9.060045 9.197005
2: CNK First Never 101-200 12.896584 13.278680 13.000772
3: CNK First Sometimes 26-100 10.972260 11.215390 10.828431
4: CNK Second Always MORE THAN 200 11.704404 11.611072 11.749586
5: CNK Second Amost always 11-25 8.086409 8.225030 8.028928
Using lapply
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars <- c("AVGVALUE1", "AVGVALUE2")
split.vars <- c("ITNUM", "SATS", "ASSETS")
lapply(mydata, function(i) {
mydata[ , Map(f = weighted.mean, x = mget(avg.vars)[i], w = mget(all.weights),
na.rm = TRUE), by = c(key(mydata)[1], split.vars)]
})
Error in weighted.mean.default(x = dots[[1L]][[1L]], w = dots[[2L]][[1L]], :
'x' and 'w' must have the same length
Using mapply
myfun <- function(data, spl.v, avg.v, wgts) {
data[ , Map(f = weighted.mean, x = mget(avg.v), w = mget(all.weights),
na.rm = TRUE), by = c(key(data)[1], spl.v)]
}
mapply(FUN = myfun, data = mydata, spl.v = split.vars, avg.v = avg.vars,
wgts = all.weights)
Error: value for ‘AVGVALUE2’ not found
I tried to wrap the mget(avg.v)
as a list - .(mget(avg.v))
, but then get another error:
Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) :
could not find function "."
Can someone help?
I. lapply
solution
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars <- c("AVGVALUE1", "AVGVALUE2")
split.vars <- c("ITNUM", "SATS", "ASSETS")
myfun <- function(avg.vars){
tmp <-
mydata[ , Map(f = weighted.mean,
x = .(get(avg.vars)),
w = mget(all.weights),
na.rm = TRUE),
by = c(key(mydata)[1], split.vars)]
return(tmp) # totally optional, a habit from using C and Java
}
lapply(avg.vars, myfun)
Up-sides:
Down-sides:
[[1]] CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 11.66824 11.66819 11.66829 2: CNK First Never 101-200 11.37378 12.21008 11.60182 3: CNK First Sometimes 26-100 12.43004 13.13450 12.01330 4: CNK Second Always MORE THAN 200 12.32265 11.81613 12.56786 5: CNK Second Amost always 11-25 10.76556 11.34669 10.52458 [[2]] CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 9.132899 9.060045 9.197005 2: CNK First Never 101-200 12.896584 13.278680 13.000772 3: CNK First Sometimes 26-100 10.972260 11.215390 10.828431 4: CNK Second Always MORE THAN 200 11.704404 11.611072 11.749586 5: CNK Second Amost always 11-25 8.086409 8.225030 8.028928
II. for
loop solution
Using a simple for
loop with the example where avg.vars
has 2 values:
all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars <- c("AVGVALUE1", "AVGVALUE2")
split.vars <- c("ITNUM", "SATS", "ASSETS")
result <- data.frame(matrix(nrow=0,ncol=7))
for(i in avg.vars){
tmp <-
mydata[ , Map(f = weighted.mean,
x = .(get(i)),
w = mget(all.weights),
na.rm = TRUE),
by = c(key(mydata)[1], split.vars)]
result <- rbind(result,tmp,use.names=F)
}
colnames(result) <- c("CLID", "ITNUM", "SATS", "ASSETS", "V1", "V2", "V3")
result
CLID ITNUM SATS ASSETS V1 V2 V3 1: CNK First Always 0-10 11.668243 11.668192 11.668287 2: CNK First Never 101-200 11.373780 12.210083 11.601819 3: CNK First Sometimes 26-100 12.430039 13.134499 12.013299 4: CNK Second Always MORE THAN 200 12.322651 11.816135 12.567860 5: CNK Second Amost always 11-25 10.765557 11.346688 10.524583 6: CNK First Always 0-10 9.132899 9.060045 9.197005 7: CNK First Never 101-200 12.896584 13.278680 13.000772 8: CNK First Sometimes 26-100 10.972260 11.215390 10.828431 9: CNK Second Always MORE THAN 200 11.704404 11.611072 11.749586 10: CNK Second Amost always 11-25 8.086409 8.225030 8.028928
Up-sides:
data.table
return
as a list (return <- list()
), creating a counter variable (n <- 1
) then replacing the rbind
statement with return[n] <- tmp
and incrementing the counter (n <- n + 1
) within the loopDown-sides:
avg.var
) then performance of any loop or function written with loops will be poor