I have a pandas DataFrame. Say I want to sample two persons of each group, I use the following code to get a new dataframe:
sample_df = df.groupby("category").apply(lambda group_df: group_df.sample(2, random_state=1234)
I would like to create a dataframe where the non-sampled persons are stored.
The sample_df
stil has the indices of the original df
so I probably have to do something with that, but I'm not sure what...
Thanks in advance!
First add group_keys=False
to groupby
for avoid category
to MultiIndex
:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'category':list('aaabbb')
})
sample_df = (df.groupby("category", group_keys=False)
.apply(lambda group_df: group_df.sample(2, random_state=1234)))
print(sample_df)
A B category
0 a 4 a
1 b 5 a
3 d 5 b
4 e 5 b
So possible filter original index values with boolean indexing
by Index.isin
and inverted mask by ~
:
non_sample_df = df[~df.index.isin(sample_df.index)]
print(non_sample_df)
A B category
2 c 4 a
5 f 4 b