Search code examples
rrandomsampling

Conditional Random Sample in R


I am wondering what the best way to solve this is. Essentially I want to generate 20 samples which add to 100 but also where (x1+x2>20). I am struggling to get something that is fast and efficient. I realise that I could filter out the lines that don't meet this criteria but it isn't efficient if I generate 10,000 rather than 20.

The code is as below:

n = 20
x1 = sample(0:100,n,replace = TRUE)
x2 = sample(0:100,n,replace = TRUE)
x3 = sample(0:100,n,replace = TRUE)
index = (x1+x2+x3)>100
G=(x1+x2)>20
while(sum(index)>0&&sum(G)>0){
   x1[index&&G] = sample(0:100,n,replace = TRUE)
   x2[index&&G] = sample(0:100,n,replace = TRUE)
   x3[index&&G] = sample(0:100,n,replace = TRUE)
index =(x1+x2+x3)>100
G=(x1+x2)>20
}
x4=rep(100,n)-x1-x2-x3

df <- data.frame(x1,x2,x3,x4)

Thanks in advance.


Solution

  • Here is an unbiased way to pick k numbers in the range 0:n which sum to n. It is based on the stars and bars encoding:

    #picks k random numbers in range 0:n which sum to n:
    
    pick <- function(k,n){
      m <- n + k - 1 #number of stars and bars
      bars <- sort(sample(1:m,k-1)) #positions of the bars
      c(bars,m+1)-c(0,bars)-1
    }
    

    This generates a single example, returning a vector. As @Guillaume Devailly observes in their answer, most of the samples will satisfy the additional constraint on the sum of the first 2 numbers, so you can just filter out those that don't.

    Note that if you want 4 numbers in the range 1:100 which sum to 100 you could just use 1 + pick(4,96).

    To enforce the constraint on the first two numbers:

    pick.sample <- function(){
      while(TRUE){
        x <- pick(4,100)
        if(sum(x[1:2]) >20) return(x)
      }
    }
    

    Then

    df <- data.frame(t(replicate(10000,pick.sample())))
    

    will create a 10,000 row dataframe where each row is a sample which satisfies the constraints.