Search code examples
pandassample

How to keep all groups using pandas+groupby+sample fraction even some groups are very small?


Let's say I have a data like this:

df=pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10,0,11,12],
                 'b':[0,0,0,0,0,0,0,0,0,0,0,0,1]})

I want to use b to put the data into two groups and sample from each. You can see that group 0 has much more data than group 1. So, if I do:

df1=df.groupby(['b']).apply(lambda x: x.sample(frac=0.1)).reset_index(drop=True)

You can find group 1 cannot be sampled. It might be sampled if frac increases.

So, what I should do to keep all the groups even it is very small?


Solution

  • Use sample reorder the dataframe , then we find the min count value per group , and the we can do head

    df1 = df.groupby('b').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)
    ming = df1.b.value_counts().min()
    df1 = df1.groupby('b').head(ming)
    df1
    Out[287]: 
         a  b
    0    8  0
    12  12  1