I have a dataset, on which I want to do sampling after groupby. In general it can be achieved with df.groupby("some_id").sample(n=100)
. But the problem is that some groups have less than 100 samples (and yes replace=True is a choice but what if we want to keep sample less, I mean if the group has more than 100 samples i want to take sample size of 100, if less - leave it as it is). I couldn't find one example of achieving something similar, and any ideas are appretiated.
For now the only idea I have is to forget about groupby, create lets say list of groups or something like
groups_list=[]
for i in df.some_id.unique():
groups_list.append(df[df_some_id==i].apply(weird_sampling))
def weird_sampling(df):
if (df.shape[0]>99):
return df.sample(100)
return df
But it seems extremely unefficient
I think the cleanest answer might be to shuffle your data and then select up to n
of each group:
# maximum number of elements in group
n = 100
# sample(frac=1) --> randomise the order
# groupby("some_id").head(n) --> select up to n
df.sample(frac=1).groupby("some_id").head(n)