Search code examples
rplyrstatistics-bootstrap

Avoiding intermediate dlply step when starting with a dataframe and ending with a dataframe


I am using plyr to perform a bootstrapping function on subsets of a dataset.

Because the boot function creates a list object, I am currently using dlply to store the output of the function, then a ddply to get just the parts of the bootfunction that I want out

My example dataset is as follows:

dat = data.frame(x = rnorm(10, sd = 1),y = rnorm(10, sd = 1),z = rep(c("SppA", "SppB", "SppC", "SppD", "SppE"), 2),u = rep(c("SiteA", "SiteB"), 5))

the exact function isn't terribly important, but for the sake of reproducibility, here is the function I'm using:

boot_fun = function(x,i) {
  i = sample(1:24, 24, replace = TRUE)
  ts1 = mean(x[i,"x"])
  ts2 = sample(x[i,"y"])
  mean(ts1) - mean(ts2)
}

My plyr function is the following:

temp = dlply(dat, c("z", "u"), summarise, boot_object = boot(dat, boot_fun, R = 1000))

Since what I want out of the boot object is the mean and CI, I then perform the following plyr function:

temp2 = ldply(temp, summarise, mean = mean(boot$t), lowCI = quantile(boot$t, 0.025), highCI = quantile(boot$t, 0.975))

This works and accomplishes exactly what I want it to (although with an error about subsetting which doesn't seem to affect anything I care about), but I feel like there should be a way to skip the intermediate dlply step.

-edit- to clarify on what I'm trying to do if I didn't need to be splitting the groups

If I was manually splitting instead of using plyr, it would look something like the following:

temp = boot(dat[dat$z == "SppA" & dat$u == "SiteA",], boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)

If I didn't care about the groups at all and just wanted to do this to the whole group it would be something like

temp = boot(dat, boot_fun, R = 1000)
temp2$mean = mean(temp$t)
temp2$lowCI = quantile(temp$t, 0.025)
temp2$highCI = quantile(temp$t, 0.975)

Solution

  • Your example is not reproducible by me.

    When I do temp = boot(dat, boot_fun, R = 1000), I get a WARNING:

    ORDINARY NONPARAMETRIC BOOTSTRAP
    Call:
    boot(data = dat, statistic = boot_fun, R = 1000)
    Bootstrap Statistics :
    WARNING: All values of t1* are NA
    

    I think your current code is pretty efficient, but if you're looking for other possibilities, you could try tidyverse to 1) group_by the relevant columns, 2) nest the relevant data for bootstrapping, 3) run your bootstrap with the nested data, 4) isolate the statistics you desire, then 5) return to a normal data frame

    library(boot)
    library(tidyverse)
    dat1 <- dat %>%
              group_by(z,u) %>%
              nest() %>%
              mutate(data=map(data,~boot(.x, boot_fun, R=1000))) %>%
              mutate(data=map(data,~data.frame(mean=mean(.x$t), lowCI=quantile(.x$t, 0.025), highCI=quantile(.x$t,0.975)))) %>%
              unnest(data)