Search code examples
rdplyrweighted-average

How to calculate weighted mean using mutate_at in R?


I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).

This is the list of columns I would like to estimate the weighted means of:

 vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")

This is the code I have been using:

 df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))

I have also tried:

 df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population)) 

As well as tested the following code on only 2 columns:

 df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population)) 

However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:

 Error in weighted.mean.default(., df$Population) : 
   'x' and 'w' must have the same length

I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!


Solution

  • There is a lot to unpack here...

    1. Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
    2. mutate_at and summarise_at are subseeded and you should use across instead.
    3. the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.

    Here is how you could do it:

    library(dplyr)
    
    df %>%
       group_by(cz) %>% 
       summarise(across(vlist, weighted.mean, Population),
                 .groups = "drop")
    

    If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:

    df %>%
       group_by(cz) %>% 
       summarise_at(vlist, ~weighted.mean(., Population)) %>%
       ungroup()
    

    I considered df and vlist like the following:

    vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
    df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
    names(df) <- vlist
    df$cz <- rep(letters[1:10], each = 10)
    df$Population <- runif(100)