I have 12 unique groups that I am trying to randomly sample from, each with a different number of observations. I want to randomly sample from the entire population (dataframe) with each group having the same probability of being selected from. The simplest example of this would be a dataframe with 2 groups.
groups probability
0 a 0.25
1 a 0.25
2 b 0.5
using np.random.choice(df['groups'], p=df['probability'], size=100)
Each iteration will now have a 50% chance of selecting group a
and a 50% chance of selecting group b
To come up with the probabilities I used the formula:
(1. / num_groups) / size_of_groups
or in Python:
num_groups = len(df['groups'].unique()) # 2
size_of_groups = df.groupby('label').size() # {a: 2, b: 1}
(1. / num_groups) / size_of_groups
Which returns
groups
a 0.25
b 0.50
This works great until I get past 10 unique groups, after which I start getting weird distributions. Here is a small example:
np.random.seed(1234)
group_size = 12
groups = np.arange(group_size)
probs = np.random.uniform(size=group_size)
probs = probs / probs.sum()
g = np.random.choice(groups, size=10000, p=probs)
df = pd.DataFrame({'groups': g})
prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict()
df['probability'] = df['groups'].map(prob_map)
plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))
plt.xticks(np.arange(group_size))
plt.show()
I would expect a fairly uniform distribution with a large enough sample size, but I am getting these wings when the number of groups is 11+. If I change the group_size
variable to 10 or lower, I do get the desired uniform distribution.
I can't tell if the problem is with my formula for calculating the probabilities, or possibly a floating point precision problem? Anyone know a better way to accomplish this, or a fix for this example?
Thanks in advance!
you are using hist
which defaults to 10
bins...
plt.rcParams['hist.bins']
10
pass group_size
as the bins
parameter.
plt.hist(
np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),
bins=group_size)