Search code examples
rvectorsample

Is it possible to do vectorized sampling by base::sample function in r?


I tried to sample 25 samples by using lapply,

a = list(c(1:5),c(100:105),c(110:115),c(57:62),c(27:32))

lapply(a,function(x)sample(x,5))

is it possible to use base::sample to do the vectorized sampling?

i.e.

sample(c(5,5),a)

Solution

  • It is not possible using base::sample; however, this kind of vectorized sampling is possible by using runif.

    I don't have a good way to vectorize sampling without replacement for an arbitrary number of samples from each vector in x. But we can sample each element of each vector.

    Here's a function that vectorizes sampling over a list of vectors. It will return a single vector of samples:

    multisample <- function(x, n = lengths(x), replace = FALSE) {
      if (replace) {
        unlist(x)[rep.int(lengths(x), n)*runif(sum(n)) + 1 + rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)]
      } else {
        unlist(x)[rank(runif(sum(n)) + rep.int(seq_along(x), n))]
      }
    }
    

    The equivalent function using lapply:

    multisample2 <- function(x, n = lengths(x), replace = FALSE) {
      unlist(lapply(seq_along(n), function(i) sample(x[[i]], n[i], replace)))
    }
    

    Example usage:

    x <- list(c(1:9), c(11:18), c(21:27), c(31:36), c(41:45))
    
    # sampling without replacement
    multisample(x)
    #>  [1]  9  3  5  8  7  2  1  4  6 18 11 17 12 16 14 13 15 22 26 25 21 27 24 23 36
    #> [26] 31 35 34 33 32 45 43 42 44 41
    multisample2(x)
    #>  [1]  3  6  7  9  2  1  8  4  5 17 16 11 15 14 13 12 18 23 22 26 21 27 24 25 33
    #> [26] 32 35 34 31 36 42 43 41 44 45
    
    # sampling with replacement
    n <- 7:3 # the number of samples from each vector
    multisample(x, n, 1)
    #>  [1]  9  8  5  9  3  5  3 12 18 12 17 12 16 26 26 24 26 27 33 33 35 32 44 44 43
    multisample2(x, n, 1)
    #>  [1]  9  8  3  7  8  7  8 15 14 15 16 18 14 27 27 21 27 27 33 36 33 34 45 44 41
    

    The vectorized version is considerably faster:

    x <- lapply(sample(10:15, 1e4, 1), seq)
    n <- sample(10, 1e4, 1)
    
    microbenchmark::microbenchmark(multisample = multisample(x),
                                   multisample2 = multisample2(x))
    #> Unit: milliseconds
    #>          expr       min       lq     mean   median      uq      max neval
    #>   multisample  9.116301 10.33815 11.05857 10.70595 11.2395  16.9397   100
    #>  multisample2 62.319401 67.38040 71.06072 69.72585 72.4703 127.0234   100
    microbenchmark::microbenchmark(multisample = multisample(x, n, 1),
                                   multisample2 = multisample2(x, n, 1))
    #> Unit: milliseconds
    #>          expr       min       lq      mean    median        uq        max neval
    #>   multisample  2.535401  2.93265  3.167103  3.130601  3.420651   4.254302   100
    #>  multisample2 56.220200 61.74615 65.638942 65.007451 67.325051 109.572501   100
    

    If a list of vectors is desired instead, the functions can be modified:

    multisample <- function(x, n = lengths(x), replace = FALSE) {
      i <- rep.int(seq_along(x), n)
      if (replace) {
        split(unlist(x)[rep.int(lengths(x), n)*runif(sum(n)) + 1 + rep.int(c(0, cumsum(lengths(x[-length(x)]))), n)], i)
      } else {
        split(unlist(x)[rank(runif(sum(lengths(x))) + i)], i)
      }
    }
    
    multisample2 <- function(x, n = lengths(x), replace = FALSE) {
      if (replace) {
        lapply(seq_along(n), function(i) sample(x[[i]], n[i], 1))
      } else {
        lapply(x, sample)
      }
    }
    

    The vectorized version is still much faster.