This is a question both about using the boot() function with grouped variables, but also about passing multiple columns of data into boot. Almost all examples of the boot() function seem to pass a single column of data to calculate a simple bootstrap of the mean.
My specific analysis is trying to use the stats::weighted.mean(x,w) function which takes a vector 'x' of values to calculate the mean and a second vector 'w' for weights. The main point is that I need two inputs into this function - and I'm hoping the solution will generalize to any function that takes multiple arguments.
I'm also looking for a solution to use this weighted.means function in a dplyr style workflow with group_by() variables. If the answer is that "it can't be done with dplyr", that's fine, I'm just trying to figure it out.
Below I simulate a dataset with three groups (A,B,C) that each have different ranges of counts. I also attempt to come up with a function "my.function" that will be used to bootstrap the weighted average. Here might be my first mistake: is this how I would set up a function to pass in the 'count' and 'weight' columns of data into each bootstrapped sample? Is there some other way to index the data?
Inside the summarise() call, I reference the original data with "." - Possibly another mistake?
The end result shows that I was able to achieve appropriately grouped calculations using mean() and weighted.mean(), but the calls for confidence intervals using boot() have instead calculated the 95% confidence interval around the global mean of the dataset.
Suggestions on what I'm doing wrong? Why is the boot() function referencing the entire dataset and not the grouped subsets?
library(tidyverse)
library(boot)
set.seed(20)
sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
weights = sample(10,150, replace = TRUE))
##Define function to bootstrap
##I'm using stats::weighted.mean() which needs to take in two arguments
##############
my.function = function(data,index){
d = data[index,] #create bootstrap sample of all columns of original data?
return(weighted.mean(d$counts, d$weights)) #calculate weighted mean using 'counts' and 'weights' columns
}
##############
## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits
## I pass data to boot using "." thinking that this would only pass each grouped subset of data
##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset.
sample.data %>%
group_by(letter) %>%
summarise(avg = mean(counts),
wtd.avg = weighted.mean(counts, weights),
CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])
And below I've calculated a rough estimate of 95% confidence intervals around the global mean to show that this is what was going on with boot() in my summarise() call above
#Here is a rough 95% confidence interval estimate as +/- 1.96* Standard Error
mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))
The following base R solution solves the problem of bootstrapping by groups. Note that boot::boot
is only called once.
library(boot)
sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
})
do.call(rbind, y)
# wtd.avg CI.LL CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872
Final clean-up:
rm(sp)
A dplyr
solution could be the following. It also calls map_dfr
from package purrr
.
library(boot)
library(dplyr)
sample.data %>%
group_split(letter) %>%
purrr::map_dfr(
function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
}
)
# wtd.avg CI.LL CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872