Search code examples
rrandomsamplingstatistics-bootstrap

How to sample randomly, with replacement (i.e. bootstrap) in R when one observation can have more than one row?


My data looks like this (this is a simplified example):

Var1 IDvar
123 1
456 2
789 2
987 3

And I would like to perform a random sampling of four observations with replacement based on the IDvar. This would be an easy task by just sampling four values from the set of UNIQUE IDvar values, i.e.:

sample(df$IDvar, replace = TRUE)

But then somethis only half of the IDvar value 2 is included in the sample. On the other hand, if only unique IDs are sampled, then it can be that the given sample size is exceeded, i.e.:

sample(unique(df$IDvar), replace = TRUE)
[1] 3 1 1 2

...this is not allowed, if the given sample size is four, but now we have five observations, since IDvar = 2 corresponds two observations.

So, is there a way to perform this type of random sampling with replacement?

One thing that came into my mind was to sample IDs one by one, and after each sample to check whether we still have "sample size" left for that ID, but is this efficient at all?


Solution

  • I implemented the following. for every group with a duplicated ID, first sample this down to one representative. Then collect the results, and do the simple sample without replacement to get an aribtrary ordering of these.

    df <- tibble::tribble(
      ~Var1, ~IDvar,
       123L,     1L,
       456L,     2L,
       789L,     2L,
       987L,     3L,
      112L,3L,
      123L,3L
      )
    
    do_a_sample <- function(df){
      parts <- split(df$Var1,df$IDvar)
      sapply(parts,\(x)sample(x,size=1))  |> sample() 
    }
    # run this as many times as you need
    do_a_sample(df)