I'm new to R. I have a normal distribution.
n <- rnorm(1000, mean=10, sd=2)
As an exercise I'd like to create a subset based on a probability curve derived from the values. E.g for values <5, I'd like to keep random 25% entries, for values >15, I'd like to keep 75% random entries, and for values between 5 and 15, I'd like to linearly interpolate the probability of selection between 25% and 75%. Seems like what I want is the "sample" command and its "prob" option, but I'm not clear on the syntax.
For the first two subsets we may use
idx1 <- n < 5
ss1 <- n[idx1][sample(sum(idx1), sum(idx1) * 0.25)]
idx2 <- n > 15
ss2 <- n[idx2][sample(sum(idx2), sum(idx2) * 0.75)]
while for the third one,
idx3 <- !idx1 & !idx2
probs <- (n[idx3] - 5) / 10 * (0.75 - 0.25) + 0.25
ss3 <- n[idx3][sapply(probs, function(p) sample(c(TRUE, FALSE), 1, prob = c(p, 1 - p)))]
where probs
are linearly interpolated probabilities for each of element of n[idx3]
. Then using sapply
we draw TRUE
(take) or FALSE
(don't take) for each of those elements.