Search code examples
rrandomsample

sample with replacement but constrain the max frequency of each member to be drawn


Is it possible to extend the sample function in R to not return more than say 2 of the same element when replace = TRUE?

Suppose I have a list:

l = c(1,1,2,3,4,5)

To sample 3 elements with replacement, I would do:

sample(l, 3, replace = TRUE)

Is there a way to constrain its output so that only a maximum of 2 of the same elements are returned? So (1,1,2) or (1,3,3) is allowed, but (1,1,1) or (3,3,3) is excluded?


Solution

  • set.seed(0)
    

    The basic idea is to convert sampling with replacement to sampling without replacement.

    ll <- unique(l)          ## unique values
    #[1] 1 2 3 4 5
    pool <- rep.int(ll, 2)   ## replicate each unique so they each appear twice
    #[1] 1 2 3 4 5 1 2 3 4 5
    sample(pool, 3)          ## draw 3 samples without replacement
    #[1] 4 3 5
    
    ## replicate it a few times
    ## each column is a sample after out "simplification" by `replicate`
    replicate(5, sample(pool, 3))
    #     [,1] [,2] [,3] [,4] [,5]
    #[1,]    1    4    2    2    3
    #[2,]    4    5    1    2    5
    #[3,]    2    1    2    4    1
    

    If you wish different value to appear up to different number of times, we can do for example

    pool <- rep.int(ll, c(2, 3, 3, 4, 1))
    #[1] 1 1 2 2 2 3 3 3 4 4 4 4 5
    
    ## draw 9 samples; replicate 5 times
    oo <- replicate(5, sample(pool, 9))
    #      [,1] [,2] [,3] [,4] [,5]
    # [1,]    5    1    4    3    2
    # [2,]    2    2    4    4    1
    # [3,]    4    4    1    1    1
    # [4,]    4    2    3    2    5
    # [5,]    1    4    2    5    2
    # [6,]    3    4    3    3    3
    # [7,]    1    4    2    2    2
    # [8,]    4    1    4    3    3
    # [9,]    3    3    2    2    4
    

    We can call tabulate on each column to count the frequency of 1, 2, 3, 4, 5:

    ## set `nbins` in `tabulate` so frequency table of each column has the same length
    apply(oo, 2L, tabulate, nbins = 5)
    #     [,1] [,2] [,3] [,4] [,5]
    #[1,]    2    2    1    1    2
    #[2,]    1    2    3    3    3
    #[3,]    2    1    2    3    2
    #[4,]    3    4    3    1    1
    #[5,]    1    0    0    1    1
    

    The count in all columns meet the frequency upper bound c(2, 3, 3, 4, 1) we have set.


    Would you explain the difference between rep and rep.int?

    rep.int is not the "integer" method for rep. It is just a faster primitive function with less functionality than rep. You can get more details of rep, rep.int and rep_len from the doc page ?rep.