Search code examples
pythonpandasgroup-bysampling

How can I sample from a dataframe weighted by groupby column


Here is a sampling method. I tried:

sample=2000 
sample_df = df.groupby('prefix').sample(n=sample, random_state=1)

It groups df by prefix and for each group, it samples 2k items. I have 9 groups. I want to sample 18k but weighted by the number in each group.


Solution

  • IIUC, here is one way:

    sample = 2000
    col_name = "prefix"
    
    probs = df[col_name].map(df[col_name].value_counts())
    sample_df = df.sample(n=sample, weights=probs)
    

    probs are the corresponding (unnormalized) weights for each value in prefix column, and we sample according to that.


    Steps on some sample data:

    >>> df
    
           B         C         D
    0   this  0.469112 -0.861849
    1   this -0.282863 -2.104569
    2  other -1.509059 -0.494929
    3   view -1.135632  1.071804
    4  other  1.212112  0.721555
    5  other -0.173215 -0.706771
    6   this  0.119209 -1.039575
    7   view -1.044236  0.271860
    8  other  0.322124  2.010234
    
    >>> col_name = "B"
    >>> sample = 4
    
    >>> counts = df[col_name].value_counts()
    >>> counts
    
    other    4
    this     3
    view     2
    Name: B, dtype: int64
    
    >>> probs = df[col_name].map(counts)
    >>> probs
    
    0    3
    1    3
    2    4
    3    2
    4    4
    5    4
    6    3
    7    2
    8    4
    Name: B, dtype: int64
    
    # seeing side-by-side with df.B
    >>> pd.concat([df.B, probs], axis=1)
    
    0   this  3
    1   this  3
    2  other  4
    3   view  2
    4  other  4
    5  other  4
    6   this  3
    7   view  2
    8  other  4
    

    i.e., each value in col_name is attached a number which, in relative, measures its weight inferred from its count in the column.

    # sampling:
    >>> sample_df = df.sample(n=sample, weights=probs, random_state=1284)
    >>> sample_df
    
           B         C         D
    6   this  0.119209 -1.039575
    3   view -1.135632  1.071804
    2  other -1.509059 -0.494929
    5  other -0.173215 -0.706771