Search code examples
rsamplevariance

How to select a random subset of a numeric vector to have a specific variance?


I have a vector of numbers of length 10000. The total variance is 0.90. I'd like to choose a random subset of this vector, which can be of any length but should have variance 0.85. Of course, I can do this by sorting the vector in ascending order and gradually removing the elements from either end of the distribution until I get the desired variance. But that'll not be a random selection. I'd like to select individuals randomly.

Update: As G5W pointed out, selecting a subset to have a specific variance is not random. I'd like to know if there is a non-random sampling method to choose a subset with a specific variance.


Solution

  • We could use an iterative method to achieve this in a (sort of) random way.

    Let's take a starting vector with 10000 elements and a variance of exactly 0.9:

    set.seed(123)
    vec <- rnorm(10000, 0, sqrt(0.9024591))
    var(vec)
    #> [1] 0.9
    

    Now, if we want to randomly subset the vector so it has a variance of 0.85, we can select a value at random, and check whether the variance falls when we remove it. If not, we keep it in our vector and sample again. If the variance drops, we remove the item from the vector. We keep repeating this until the variance drops to 0.85:

    v <- vec
    
    while(var(v) > 0.85)
    {
      var_v <- var(v)
      i <- sample(length(v), 1)
      if(var(v[-i] < var_v)) v <- v[-1]
    }
    
    var(v)
    #> [1] 0.8476715
    
    length(v)
    #> [1] 343
    

    We could get this closer to 0.85 by backing up once the variance falls below the threshold, and removing whichever single value takes the variance closest to 0.85. It comes down to whether randomness or closeness to 0.85 is your priority

    Created on 2020-07-11 by the reprex package (v0.3.0)