Search code examples
pythonpython-3.xnumpy

Sampling without replacement with unequal weights


I am trying to find an algorithm that selects 12 elements from a list of 25 according to the probability of each one of them. These elements cannot be repeated.

I've tried using the function without replacement, but I can't get it to keep the weights when I do 1000 iterations to check that it was done correctly. I have only succeeded if I use the function with replace = True.

Since that function returns duplicate items to me, then I clean the list removing the duplicates. By doing this, the probability that I get each number is not the one that I had defined at the beginning. I am aware that the error is here and that I should be using the function with replace = False, but I can't get it to give me the result.

import numpy as np
from collections import Counter 

nameList = ['AAAAAA', 'B', 'C', 'D','E','F','G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P','Q','R','S','T','U','V','W','X','ZZZZZZZZZZZZZZZZ']
probability_nameList = [0.10, 0.01, 0.02, 0.03, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04]

sampleNames =[]
sampleNamesControl =[]
cleaned_result = []
test_list = []

for i in range(1000):
    sampleNames += np.random.choice(nameList, 100,replace=True, p=probability_nameList)
    for item in sampleNames:    
        if not item in cleaned_result:
            cleaned_result += [item] 
    sampleNames =[]
    test_list += cleaned_result[0:12]
    cleaned_result = []

print(Counter(test_list))

Response:

Counter({'AAAAAA': 832, 'F': 522, 'X': 513, 'ZZZZZZZZZZZZZZZZ': 506, 'I': 505, 'L': 504, 'N': 501, 'S': 499, 'T': 498, 'U': 496, 'R': 492, 'O': 491, 'J': 489, 'P': 488, 'E': 487, 'V': 485, 'K': 482, 'Q': 479, 'H': 473, 'G': 471, 'M': 468, 'W': 461, 'D': 404, 'C': 297, 'B': 157})

As you can see, AAAAA had to be 10 times bigger than B, and that is not happening. P(AAAAA)= 10xP(B)


Solution

  • This is not a programming issue but fundamental maths: what you want is not possible.

    You need to take into account conditional probabilities.

    I'll give you a small example. Let's take 4 letters A,B,C,D and pick 2 of them, without replacement, with A having a 97% probability of being picked and the others 1%.

    Given A's high probability, it will be picked most of the time. But once we have picked it, as we are without replacement, there is no more A and a 1/3 chance to pick one of the other 3 values.

    Indeed:

    from itertools import chain
    from collections import Counter
    
    Counter(chain.from_iterable(np.random.choice(list('ABCD'),
                                                 2, replace=False,
                                                 p=[.97, .01, .01, .01])
                                for i in range(10000)))
    

    Output:

    Counter({'A': 9995, 'C': 3358, 'B': 3317, 'D': 3330})
    

    A is picked almost all the time and the others are picked about one third of the time.

    NB. The end frequency of B,C,D is roughly 1/3 because the probability of A is close to 1, if p(A) was smaller, the calculation of the final frequencies would be a bit more complex. The final frequency of B (or C or D) is 0.97*1/3+0.01*(0.01/0.99)*2+0.01 so a bit above 1/3, and that of A is 0.97+0.01*(0.97/0.99)*3 so ~0.9994