Search code examples
rlistdata.tableweighted-average

R: data.table compute weighted means of multiple variables with multiple weight variables each, by group


I am still new to data.table. My question is similar to this one and this one. The difference is that I want to compute weighted means for multiple variables, by groups, but using more than one weight for each mean.

Consider the following data.table (the actual is much larger):

library(data.table)

set.seed(123456)

mydata <- data.table(CLID = rep("CNK", 10),
                     ITNUM = rep(c("First", "Second", "First", "First", "Second"), 2),
                     SATS = rep(c("Always", "Amost always", "Sometimes", "Never", "Always"), 2),
                     ASSETS = rep(c("0-10", "11-25", "26-100", "101-200", "MORE THAN 200"), 2),
                     AVGVALUE1 = rnorm(10, 10, 2),
                     AVGVALUE2 = rnorm(10, 10, 2),
                     WGT1 = rnorm(10, 3, 1),
                     WGT2 = rnorm(10, 3, 1),
                     WGT3 = rnorm(10, 3, 1))

#I set the key of the table to the variables I want to group by,
#so the output is sorted
setkeyv(mydata, c("CLID", "ITNUM", "SATS", "ASSETS"))

What I want to achieve is to compute the weighted means for AVGVALUE1 and AVGVALUE2 (and possibly more variables) by groups defined by ITNUM, SATS, ASSETS using each one of the weight variables WGT1, WGT2, WGT3 (and possibly more of them). So for each of the variables I want to compute the weighted means I will have three weighted means by groups (or whatever the number of weights is).

I can do it for each variable separately, e.g.:

all.weights <- c("WGT1", "WGT2", "WGT3")
avg.var <- "AVGVALUE1"
split.vars <- c("ITNUM", "SATS", "ASSETS")

mydata[ , Map(f = weighted.mean, x = .(get(avg.var)), w = mget(all.weights),
na.rm = TRUE), by = c(key(mydata)[1], split.vars)]

I add the first key variable in by, although it is a constant, because I would like to have it as a column in the output. And I get:

   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

However, with the actual data.table, where I have much more columns to compute the weighted means for (as well as much more weights to use), it would be rather cumberosme to do it one by one. What I imagine is a function where the mean for each of the variables (AVGVALUE1, AVGVALUE2 and so on) is computed with each of the weight variables (WGT1, WGT2, WGT3 and so on) and the output for each variable for which the weighted mean is computed is added to a list. I guess the list would be the best option because if all estimates are in the same output, the number of columns might be endless. So something like this:

[[1]]
   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

[[2]]
   CLID  ITNUM         SATS        ASSETS        V1        V2        V3
1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928

What I tried so far:

  1. Using lapply

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars <- c("AVGVALUE1", "AVGVALUE2")
    split.vars <- c("ITNUM", "SATS", "ASSETS")
    
    lapply(mydata, function(i) {
    mydata[ , Map(f = weighted.mean, x = mget(avg.vars)[i], w = mget(all.weights),
    na.rm = TRUE), by = c(key(mydata)[1], split.vars)]
    })
    
    Error in weighted.mean.default(x = dots[[1L]][[1L]], w = dots[[2L]][[1L]],  : 
     'x' and 'w' must have the same length
    
  2. Using mapply

    myfun <- function(data, spl.v, avg.v, wgts) {
      data[ , Map(f = weighted.mean, x = mget(avg.v), w = mget(all.weights),
      na.rm = TRUE), by = c(key(data)[1], spl.v)]
    }
    
    mapply(FUN = myfun, data = mydata, spl.v = split.vars, avg.v = avg.vars,
    wgts = all.weights)
    
    Error: value for ‘AVGVALUE2’ not found
    

I tried to wrap the mget(avg.v) as a list - .(mget(avg.v)), but then get another error:

 Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) : 
  could not find function "." 

Can someone help?


Solution

  • I. lapply solution

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
    split.vars  <- c("ITNUM", "SATS", "ASSETS")
    
    myfun <- function(avg.vars){
      tmp <-
        mydata[ , Map(f = weighted.mean, 
                    x = .(get(avg.vars)), 
                    w = mget(all.weights),
                    na.rm = TRUE), 
              by = c(key(mydata)[1], split.vars)]  
    
      return(tmp) # totally optional, a habit from using C and Java
    }
    
    lapply(avg.vars, myfun)
    

    Up-sides:

    • Makes use of an *apply
    • Avoids loop
    • Way faster than doing it one by one

    Down-sides:

    • Returns a list
    [[1]]
       CLID  ITNUM         SATS        ASSETS       V1       V2       V3
    1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
    2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
    3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
    4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
    5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458
    
    [[2]]
       CLID  ITNUM         SATS        ASSETS        V1        V2        V3
    1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
    2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
    3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
    4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
    5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928
    

    II. for loop solution

    Using a simple for loop with the example where avg.vars has 2 values:

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
    split.vars  <- c("ITNUM", "SATS", "ASSETS")
    
    result <- data.frame(matrix(nrow=0,ncol=7))
    for(i in avg.vars){
      tmp <- 
        mydata[ , Map(f = weighted.mean, 
                    x = .(get(i)), 
                    w = mget(all.weights),
                    na.rm = TRUE), 
              by = c(key(mydata)[1], split.vars)]  
    
      result <- rbind(result,tmp,use.names=F)
    }
    colnames(result) <- c("CLID", "ITNUM", "SATS", "ASSETS", "V1", "V2", "V3")
    result
    
        CLID  ITNUM         SATS        ASSETS        V1        V2        V3
     1:  CNK  First       Always          0-10 11.668243 11.668192 11.668287
     2:  CNK  First        Never       101-200 11.373780 12.210083 11.601819
     3:  CNK  First    Sometimes        26-100 12.430039 13.134499 12.013299
     4:  CNK Second       Always MORE THAN 200 12.322651 11.816135 12.567860
     5:  CNK Second Amost always         11-25 10.765557 11.346688 10.524583
     6:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
     7:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
     8:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
     9:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
    10:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928
    

    Up-sides:

    • Completes instantly in the example
    • Scales to any number of columns with no additional data manipulation/coding
    • Will save a huge amount of time over going one by one
    • Returns a nice data.table
    • if you actually want a list you can get that by initializing return as a list (return <- list()), creating a counter variable (n <- 1) then replacing the rbind statement with return[n] <- tmp and incrementing the counter (n <- n + 1) within the loop

    Down-sides:

    • If your data is very large (e.g. > 100,000 rows and dozens or more values of avg.var) then performance of any loop or function written with loops will be poor