Search code examples
rprobabilitysample

R: select a subset based on probability


I'm new to R. I have a normal distribution.

n <- rnorm(1000, mean=10, sd=2)

As an exercise I'd like to create a subset based on a probability curve derived from the values. E.g for values <5, I'd like to keep random 25% entries, for values >15, I'd like to keep 75% random entries, and for values between 5 and 15, I'd like to linearly interpolate the probability of selection between 25% and 75%. Seems like what I want is the "sample" command and its "prob" option, but I'm not clear on the syntax.


Solution

  • For the first two subsets we may use

    idx1 <- n < 5
    ss1 <- n[idx1][sample(sum(idx1), sum(idx1) * 0.25)]
    idx2 <- n > 15
    ss2 <- n[idx2][sample(sum(idx2), sum(idx2) * 0.75)]
    

    while for the third one,

    idx3 <- !idx1 & !idx2
    probs <- (n[idx3] - 5) / 10 * (0.75 - 0.25) + 0.25
    ss3 <- n[idx3][sapply(probs, function(p) sample(c(TRUE, FALSE), 1, prob = c(p, 1 - p)))]
    

    where probs are linearly interpolated probabilities for each of element of n[idx3]. Then using sapply we draw TRUE (take) or FALSE (don't take) for each of those elements.