Search code examples
pythonpandasnumpystatisticsprobability

Calculate probability 2 random people are in the same group?


In my dataset, there are N people who are each split into one 3 groups (groups = {A, B, C}). I want to find the probability that two random people, n_1 and n_2, belong to the same group.

I have data on each of these groups and how many people belong to them. Importantly, each group is of a different size.

import pandas as pd
import numpy as np
import math 

data = {
    "Group": ['A', 'B', 'C'],
    "Count": [20, 10, 5],
}

df = pd.DataFrame(data)
  Group  Count
0     A     20
1     B     10
2     C      5

I think I know how to get the sample space, S but I am unsure how to get the numerator.

def nCk(n,k):
  f = math.factorial
  return f(n) / f(k) / f(n-k)

n = sum(df['Count'])
k = 2
s = nCk(n, k)

Solution

  • My discrete mathematics skills are a bit rusty so feel free to correct me. You have N people split into groups of sizes s_1, ..., s_n so that N = s_1 + ... + s_n.

    1. The chance of one random person belonging to group i is s_i / N
    2. The chance of a second person being in group i is (s_i - 1) / (N - 1)
    3. The chance of both being in group i is s_i / N * (s_i - 1) / (N - 1)
    4. The probability of them being together in any group is the sum of the probabilities in #3 across all groups.

    Code:

    import numpy as np
    
    s = df['Count'].values
    n = s.sum()
    prob = np.sum(s/n * (s-1)/(n-1)) # 0.4117647058823529
    

    We can generalize this solution to "the probability of k people all being in the same group":

    k = 2
    i = np.arange(k)[:, None]
    tmp = (s-i) / (n-i)
    prob = np.prod(tmp, axis=0).sum()
    

    When k > s.max() (20 in this case), the answer is 0 because you cannot fit all of them in one group. When k > s.sum() (35 in this case), the result is nan.