Search code examples
pythonpandasdataframepandas-groupby

Groupby sample pandas with keeping the groups lower than n if applicable


I have a dataset, on which I want to do sampling after groupby. In general it can be achieved with df.groupby("some_id").sample(n=100) . But the problem is that some groups have less than 100 samples (and yes replace=True is a choice but what if we want to keep sample less, I mean if the group has more than 100 samples i want to take sample size of 100, if less - leave it as it is). I couldn't find one example of achieving something similar, and any ideas are appretiated. For now the only idea I have is to forget about groupby, create lets say list of groups or something like

groups_list=[]

for i in df.some_id.unique():


    groups_list.append(df[df_some_id==i].apply(weird_sampling))

def weird_sampling(df):

    if (df.shape[0]>99):
        return df.sample(100)
    return df

But it seems extremely unefficient


Solution

  • I think the cleanest answer might be to shuffle your data and then select up to n of each group:

    # maximum number of elements in group
    n = 100
    
    # sample(frac=1) --> randomise the order
    # groupby("some_id").head(n) --> select up to n
    df.sample(frac=1).groupby("some_id").head(n)