Let's say I have a data like this:
df=pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10,0,11,12],
'b':[0,0,0,0,0,0,0,0,0,0,0,0,1]})
I want to use b to put the data into two groups and sample from each. You can see that group 0 has much more data than group 1. So, if I do:
df1=df.groupby(['b']).apply(lambda x: x.sample(frac=0.1)).reset_index(drop=True)
You can find group 1 cannot be sampled. It might be sampled if frac increases.
So, what I should do to keep all the groups even it is very small?
Use sample
reorder the dataframe , then we find the min
count value per group , and the we can do head
df1 = df.groupby('b').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)
ming = df1.b.value_counts().min()
df1 = df1.groupby('b').head(ming)
df1
Out[287]:
a b
0 8 0
12 12 1