Search code examples
pythonpandasdataframesamplesampling

Python: sample from dataframe, storing the non-sampled


I have a pandas DataFrame. Say I want to sample two persons of each group, I use the following code to get a new dataframe:

sample_df = df.groupby("category").apply(lambda group_df: group_df.sample(2, random_state=1234)

I would like to create a dataframe where the non-sampled persons are stored.

The sample_df stil has the indices of the original df so I probably have to do something with that, but I'm not sure what...

Thanks in advance!


Solution

  • First add group_keys=False to groupby for avoid category to MultiIndex:

    df = pd.DataFrame({
            'A':list('abcdef'),
             'B':[4,5,4,5,5,4],
             'category':list('aaabbb')
    })
    
    sample_df = (df.groupby("category", group_keys=False)
                   .apply(lambda group_df: group_df.sample(2, random_state=1234)))
    print(sample_df)
       A  B category
    0  a  4        a
    1  b  5        a
    3  d  5        b
    4  e  5        b
    

    So possible filter original index values with boolean indexing by Index.isin and inverted mask by ~:

    non_sample_df = df[~df.index.isin(sample_df.index)]
    print(non_sample_df)
       A  B category
    2  c  4        a
    5  f  4        b