Search code examples
pythonnumpyprobability

How to use weights in numpy.random.choice without replacement to get desired sample


I have an array of 400,000 zeros and 100,000 ones and I want to take a sample without replacement of this to get approximately 50% zeros and 50% ones.

numpy.random.choice offers the ability to specify a probability distribution with which to select. So to calculate the weights necessary to get 50/50:

weight = desired_prob/actual_prob (normalized so that sum equals 1) So the weight for 1's is 2.5 and the weight for 0's is .625

I expect that given the code below, I will be able to get a sample using np.random.choice that will converge on a mean of .5 as I increase the sample size (to a max of 200,000, in which case I would use all the 1's). This is true if replace=True.

But if I don't want to use replacement then what I find is that as I increase the sample size, the mean first moves toward .5 and then goes farther and farther down. My only explanation for this is that numpy is internally sampling sequentially and as it gets more 0's, it doesn't adjust so that getting a 1 is still as likely as when it took the first sample, but I'm not sure.

Why is this happening and how can I weight it so that my sample has the desired 50/50 ratio without replacement?

Here's the code demonstrating this

import numpy as np
import matplotlib.pyplot as plt

array = np.r_[np.ones(100_000), np.zeros(400_000)]
weights = array.copy()
weights[weights==1] = 2.5
weights[weights==0] = 0.625
normalized_weights = weights / weights.sum()
sample_sizes = (1_000, 5_000, 10_000, 50_000, 100_000, 200_000)
means = []
for sample_size in sample_sizes:
    means.append(np.mean(np.random.choice(array, sample_size, False, p=normalized_weights)))
plt.plot(sample_sizes, means, marker="x")
plt.ylabel("Mean")
plt.xlabel("Sample size")

enter image description here


Solution

  • Numpy should raise a value error if your sample size if larger than your population size if replace = False. The fact that it doesn't throw an error must be a bug.

    Edit: I misread the number. Now I see the issue. Yes if you are sampling without replacement and the probability weight of the 1s and 0s are as you set them, then as you sample, you will initially be removing the 1s and 0s at the same rate. Eventually you will start running out of 1s to sample (remember: replacement=False) even tho the weight of the 1s was initially set to be 4 times larger than the weight of the 0s. So yes, you are right that your weights that you passed to np.random.choice do not change as it samples, this would require numpy to re-normalize your weights every time which would be inefficient and wouldn't really make sense since there isn't really any need to do this calculation in this way that you are asking. But to answer your question, if you really wanted to do this, you could do something like this:

    import numpy as np
    import matplotlib.pyplot as plt
    
    array = np.r_[np.ones(100_000), np.zeros(400_000)]
    weights = array.copy()
    weights[weights==1] = 2.5
    weights[weights==0] = 0.625
    normalized_weights = weights / weights.sum()
    sample_sizes = (1_000, 5_000, 10_000, 50_000, 100_000, 200_000)
    means = []
    for sample_size in sample_sizes:
        #
        # sample one by one
        #
        means.append(np.mean([np.random.choice(array, 1, False, p=normalized_weights) for _ in range(sample_sizes)]))
    plt.plot(sample_sizes, means, marker="x")
    plt.ylabel("Mean")
    plt.xlabel("Sample size")
    

    i.e. you actually do have to sample them one by one to achieve the desired effect, unfortunately