When I execute this code, I expect both functions to return roughly 0.6. Yet, always_sunny
returns approx. 0.6 as expected, but guess_with_weight
returns approx. 0.52 instead.
import random
k = 1000000
def sample():
return random.choices(population=['rain', 'sunny'], weights=[0.4, 0.6], k=k)
def always_sunny(xs):
return sum(
[
(1 if x == 'sunny' else 0)
for x in xs
]
)
def guess_with_weight(xs):
return sum(
[
(1 if x == y else 0)
for x, y in zip(xs, sample())
]
)
xs = sample()
print((
always_sunny(xs)/ k * 1.0,
guess_with_weight(xs) / k * 1.0
))
Why is that? If there is 60% chance of being sunny, and I always guess 'sunny' I should be right 60% of the time, right (and the code agrees). If I instead guess 'rain' 40% of the time, and 'sunny' 60% of the time, shouldn't I also be right 60% of the time? Why does the code suggest I'm only right about 52% of the time?
60% of the time, the right answer is "sunny". In 60% of that 60% (36%), you guess "sunny", and you're right, and in 40% of that 60% (24%), you guess "rain", and you're wrong.
40% of the time, the right answer is "rain". In 60% of 40% (24%) you're wrong, and in 40% of 40% (16%) you're right.
Adding those two scenarios together, we get:
Because "sunny" is the most likely answer, and because the probability of any given element being "sunny" vs "rain" is independent of all the others, "sunny" is always the best guess; guessing "sunny" 100% of the time allows you to have a 100% success rate for the 60% of the time that "sunny" is the answer, for an overall success rate of 60%. Any strategy that doesn't involve guessing "sunny" 100% of the time is therefore going to have a lower success rate than 60%.
Another way to look at it: guessing "sunny" all the time (the best strategy) gets you 60%, and guessing "rain" all the time (the worst strategy) gets you 40%. Any strategy that involves a mix of guesses is going to land you somewhere in between, proportionate to whatever ratio of best vs worst guesses you make. 52% is exactly 60% of the way from 40% to 60%.