please I need more clarity on this, I really do not understand it well. Using this as an example
import random
my_list = [9999, 45, 63, 19, 89, 5, 72]
cum_w = [1, 9, 10, 9, 2, 12, 7]
d_rand = random.choices(my_list, cum_weights=cum_w, k=7)
sum = 0
for idx, i in enumerate(cum_w):
if idx == 0:
for i in cum_w: sum += i
print(f"cum_weight for {my_list[idx]}\t= {i/sum}\tRandom={random.choices(my_list, cum_weights=cum_w, k=7)}")
Below is the output
cum_weight for 9999 = 0.14 Random=[45, 45, 9999, 45, 45, 9999, 45]
cum_weight for 45 = 0.18 Random=[45, 45, 45, 45, 9999, 45, 45]
cum_weight for 63 = 0.2 Random=[45, 45, 45, 9999, 9999, 9999, 45]
cum_weight for 19 = 0.18 Random=[45, 45, 45, 45, 45, 45, 9999]
cum_weight for 89 = 0.04 Random=[9999, 45, 45, 45, 45, 9999, 45]
cum_weight for 5 = 0.24 Random=[45, 45, 45, 45, 45, 45, 45]
cum_weight for 72 = 0.14 Random=[45, 45, 9999, 45, 45, 45, 45]
The probability of 9(cum_w[1] and cum_w[3]) are 0.18. Why does 45(9) occur so often?
I've read random.choices documentation and does not really get to me.
How does the cum_weights works?
Please, I kindly need depth knowledge on this.
You asked "Why does 45(9) occur so often?" and "How do the cum_weights work?" Addressing the second question will explain the first. Note that what follows is an implementation of one approach used for this kind of problem. I'm not claiming that this is python's implementation, it is intended to illustrate the concepts involved.
Let's start by looking at how values can be generated if you use cumulative weights, i.e., a list where at each index the entry is the sum of all weights up to and including the current index.
import random
# Given cumulative weights, convert them to proportions, then generate U ~ Uniform(0,1)
# random values to use in a linear search to generate values in the correct proportions.
# This is based on the well-known probability result that P{a<=U<=b} = (b - a) for
# 0 <= a < b <= 1.
def gen_cumulative_weighted(values, c_weights): # values and c_weights must be lists of the same length
# Convert cumulative weights to probabilities/proportions by dividing by the last value.
# This yields a list of non-decreasing values between 0 and 1. Note that the last entry
# is always 1, so a Uniform(0, 1) random number will *always* be less than or equal to
# some entry in the list.
p = [c_weights[i] / c_weights[-1] for i in range(len(c_weights))]
while True:
index = 0 # starting from the beginning of the list
# The following three lines find the first index having the property u <= p[index].
u = random.random()
while u > p[index]:
index += 1
yield(values[index]) # yield the corresponding value.
As the comments point out, the weights are scaled by the last (and largest) value to scale them to a set of values in the range (0,1). These can be thought of as the right-most endpoints of non-overlapping subranges, each of which has a length equal to the corresponding scaled weight. (Sketch it out on paper if this is unclear, you should see it pretty quickly.) A generated Uniform(0,1) value will fall in one of those subranges, and the probability it does so is equal to the length of the subrange according to a well-known result from probability.
If we have the raw weights rather than the cumulative weights, all we have to do is convert them to cumulative and then pass the work off to the cumulative weighted version of the generator:
def gen_weighted(values, weights): # values and weights must be lists of the same length
cumulative_w = [sum(weights[:i+1]) for i in range(len(weights))]
return gen_cumulative_weighted(values, cumulative_w)
Now we're ready to use the generators:
my_values = [9999, 45, 63, 19, 89, 5, 72]
my_weights = [1, 9, 10, 9, 2, 12, 7]
good_gen = gen_weighted(my_values, my_weights)
print('Passing raw weights to the weighted implementation:')
print([next(good_gen) for _ in range(20)])
which will produce results such as:
Passing raw weights to the weighted implementation:
[63, 5, 63, 63, 72, 19, 63, 5, 45, 63, 72, 19, 5, 89, 72, 63, 63, 19, 89, 45]
Okay, so what happens if we pass raw weights to the cumulative weighted version of the algorithm? Your raw weights of [1, 9, 10, 9, 2, 12, 7]
get scaled by dividing by the last value, and become [1/7, 9/7, 10/7, 9/7, 2/7, 12/7, 1]
. When we generate u
~ Uniform(0, 1) and use it to search linearly through the scaled weights, it will yield index zero => 9999 with probability 1/7, and index one => 45 with probability 6/7! This happens because u
is always ≤ 1, and therefore always less than 9/7. As a result, the linear search will never get past any scaled weight ≥ 1, which for your inputs means it can only generate the first two values and does so with the wrong weighting.
print('Passing raw weights to the cumulative weighted implementation:')
bad_gen = gen_cumulative_weighted(my_values, my_weights)
print([next(bad_gen) for _ in range(20)])
produces results such as:
Passing raw weights to the cumulative weighted implementation:
[45, 45, 45, 45, 45, 45, 45, 9999, 45, 9999, 45, 45, 45, 45, 45, 9999, 45, 9999, 45, 45]