Search code examples
rsamplereplicatestatistics-bootstrap

How to bootstrap a function after taking a randomly drawn sample without replacement


I have some code that allows me to take two randomly drawn samples from a dataset, apply a function and repeat the procedure a certain number of times (see below code from associated question: How to bootstrap a function with replacement and return the output).

Example data:

> dput(a)
structure(list(index = 1:30, val = c(14L, 22L, 1L, 25L, 3L, 34L, 
35L, 36L, 24L, 35L, 33L, 31L, 30L, 30L, 29L, 28L, 26L, 12L, 41L, 
36L, 32L, 37L, 56L, 34L, 23L, 24L, 28L, 22L, 10L, 19L), id = c(1L, 
2L, 2L, 3L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 16L, 16L, 17L, 18L, 19L, 20L, 21L, 21L, 22L, 23L, 24L, 
25L)), .Names = c("index", "val", "id"), class = "data.frame", row.names = c(NA, 
-30L))

Code:

   library(plyr)
    extractDiff <- function(P){
      subA <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a random sample of 15 rows
      subB <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a second random sample of 15 rows
      meanA <- mean(subA$val)
      meanB <- mean(subB$val)
      diff <- abs(meanA-meanB)
      outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
      return(outdf)
    }

    set.seed(42)
    fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))

Rather than taking TWO randomly drawn samples of size 15, I would like to take one randomly drawn sample of size 15, then extract the remaining 15 rows in the dataset after the first random draw has been taken (i.e. subA would equal the first randomly drawn sample of 15 obs, subB would equal the remaining 15 obs after subA has been taken). I am really not sure how to go about doing this. Any help would be really appreciated. Thanks!


Solution

  • I believe you can do this by making a small change to your code as so.

    extractDiff <- function(P){
      sampleset = sample(nrow(P), 15, replace=FALSE) #select the first 15 rows, note replace=FALSE
      subA <- P[sampleset, ] # takes the 15 selected rows
      subB <- P[-sampleset, ] # takes the remaining rows in the set
      meanA <- mean(subA$val)
      meanB <- mean(subB$val)
      diff <- abs(meanA-meanB)
      outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
      return(outdf)
    }
    

    However, please note that this is not compatable with boot strapping as bootstrapping requires replacement. If on the other hand you want to sample with replacement from the data set, and then sample with replacement from the dataset not selected in the first sampling you could do the following.

    extractDiff <- function(P){
      sampleset1 = sample(nrow(P), 15, replace=TRUE) #select the first 15 rows, note replace=TRUE
      sampleset2 = sample((1:nrow(P))[-unique(sampleset1)],15,replace=TRUE) #selects only from rows not used in sampleset1
      subA <- P[sampleset1, ] # takes the 15 selected rows
      subB <- P[sampleset2, ] # takes the 15 selected rows in the remaining set set
      meanA <- mean(subA$val)
      meanB <- mean(subB$val)
      diff <- abs(meanA-meanB)
      outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
      return(outdf)
    }
    

    However this still may not be ideal depending on your application as the second dataset is more likely to have multiple instances of a value than the first. If you were selecting a smaller proportion of the total set it would be much less of a problem. You may be better off dividing the set into two using 'shuffle' and sampling with replacement from both halves so the two sets are more even, but this will prevent the first set from being a true boot strapping set again.