In my dataset, there are N
people who are each split into one 3 groups (groups = {A, B, C})
. I want to find the probability that two random people, n_1
and n_2
, belong to the same group.
I have data on each of these groups and how many people belong to them. Importantly, each group is of a different size.
import pandas as pd
import numpy as np
import math
data = {
"Group": ['A', 'B', 'C'],
"Count": [20, 10, 5],
}
df = pd.DataFrame(data)
Group Count
0 A 20
1 B 10
2 C 5
I think I know how to get the sample space, S
but I am unsure how to get the numerator.
def nCk(n,k):
f = math.factorial
return f(n) / f(k) / f(n-k)
n = sum(df['Count'])
k = 2
s = nCk(n, k)
My discrete mathematics skills are a bit rusty so feel free to correct me. You have N
people split into groups of sizes s_1, ..., s_n
so that N = s_1 + ... + s_n
.
i
is s_i / N
i
is (s_i - 1) / (N - 1)
i
is s_i / N * (s_i - 1) / (N - 1)
Code:
import numpy as np
s = df['Count'].values
n = s.sum()
prob = np.sum(s/n * (s-1)/(n-1)) # 0.4117647058823529
We can generalize this solution to "the probability of k
people all being in the same group":
k = 2
i = np.arange(k)[:, None]
tmp = (s-i) / (n-i)
prob = np.prod(tmp, axis=0).sum()
When k > s.max()
(20 in this case), the answer is 0 because you cannot fit all of them in one group. When k > s.sum()
(35 in this case), the result is nan
.