Search code examples
pythonnumpymatplotlibprobability-theory

Plot Probability Curve with Summation


I have the following problem:

I'm working on a formula to calculate some network effects. The idea is that I have 450 "red users" and 6550 "blue users" which sums up to 7000 users in total. Now I would like to plot "picking x users (the same user cannot be picked twice, so this is sampling without replacement) and calculate the probability that at least 1 user is red".

E.g for x = 3, that means I'm picking 3 random users out of 7000 and check if any of these are "red users"

The probability for having at least 1 red user is p = 1 - the probability all 3 picks are blue users and the probability for a blue user is equal to p = 6550/7000, right?

Resulting in a probability for at least 1 red user: * p = 1 - 6550/7000 * 6549/6999 * 6548/6998 *

Therefore i came up with the formula:

f(x) = e^-(1- sum of (6500-i)/(7000-i)); for i = 0, till x)

What I've realized is that the curve is pretty edgy since it's just going from a value in ℕ to the next value in ℕ. Although adding decimal numbers wouldn't make that much sense since "picking 0,5 users or even 0,01 users" is just stupid, I would like to see the full graph in order to be able to compare the formula to some others.

Is there any way I can implement this in python?

Best regards,

Korbi


Solution

  • What you are looking for has been extensively studied before, and is known as the hypergeometric distribution in probability theory and statistics. There is thus no need to re-invent the wheel!

    We are looking for at least one red user, in a sample of varying size x. This is equivalent to 1 - Pr(0 red users | sample size = x), that is, one minus its complement.

    Let us illustrate this by considering sample sizes in [1, # red users]. Some Python code to help you along,

    from scipy.stats import hypergeom
    import matplotlib.pyplot as plt
    
    red = 450
    total = 7000
    
    sample_sizes = list(range(1, red + 1))
    
    probabilities = [1 - hypergeom(total, red, sample_size).pmf(0)
                     for sample_size in sample_sizes]
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(sample_sizes, probabilities, 'bo')
    
    ax.set_xlabel('Users drawn (#)')
    ax.set_ylabel('Probability of at least one red user')
    plt.show()
    

    Which yields the following graph,

    Probability of at least one red users against sample size.

    Clearly, the probability of drawing at least one red user increases rapidly as we increase the size of the sample - nothing we did not expect, given our knowledge of the hypergeometric distribution!