Search code examples
rprobabilitysample

Generate weighted sample from weighted list


I have a list of historical frequencies of elements that have occurred together over time. These elements may have occurred (without repetition) in sequences of various order and length.

For example, this could be a list of historic sequences: abc gabd ace

My challenge is to collect a simulated of size n from a list of weighted probabilities. So a has appeared in 90% of the historic sequences, b 70% and so on.

What is a simple way I can generate a weighted sample of 3 elements. Eventually I will put this in a loop to simulate that sample 100s of times and collect results but for now generating a single sample will help get me in the right direction.

library(tibble)

historical_p <-
  tribble(
    ~element, ~p,
    'a', .9,
    'b', .7,
    'c', .5,
    'd', .1,
    'e', .1,
    'f', .1,
    'g', .1
  )

Solution

  • Use sample with the prob argument to generate one sample of n values chosen without replacement from the set elements with weights p:

    set.seed(369894129)
    
    element <- letters[1:7]
    p <- c(0.9, 0.7, 0.5, 0.1, 0.1, 0.1, 0.1) # weights
    n <- 3                                    # number of elements per sample
    
    sample(element, n, FALSE, p)
    #> [1] "a" "f" "b"
    

    A way to generate N samples (inspired by this answer):

    N <- 1e5 # number of samples
    system.time({
      s <- Rfast::colOrder(
        matrix(runif(N*length(p)), length(p))^(1/p), FALSE, TRUE
      )[1:n,]
      s[] <- element[s]
    })
    #>    user  system elapsed 
    #>    0.15    0.04    0.19
    

    View the first 10 samples.

    s[,1:10]
    #>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
    #> [1,] "a"  "a"  "a"  "c"  "d"  "b"  "b"  "c"  "c"  "c"  
    #> [2,] "c"  "e"  "b"  "a"  "c"  "a"  "a"  "a"  "f"  "a"  
    #> [3,] "b"  "f"  "d"  "d"  "a"  "c"  "c"  "g"  "b"  "b"