Search code examples
pythonpandasgroup-byprobabilitysample

Group dataframe and sample n rows with equal probability between groups


I have a pandas dataframe like this:

     ID  Value
0     a     2
1     a     4
2     b     6
3     c     8
4     c    10
5     c    12

I would like to sample equally from the ID groups. I know I can group the data frame by ID and then specify the number of rows I want to sample from each group like this: df.groupby("ID").sample(n=2, replace = True) However, I just want the probability of sampling from a group to be the same, not necessarily the exact same number of rows.


Solution

  • If you want to sample N rows with about the same probability to sample each group, you could oversample per group then sample again:

    import math
    
    N = 4
    
    out = (df.groupby('ID').sample(n=math.ceil(N/df['ID'].nunique()), replace=True)
             .sample(N)
          )
    

    Example output:

      ID  Value
    2  b      6
    2  b      6
    4  c     10
    1  a      4
    

    With N = 10:

      ID  Value
    0  a      2
    2  b      6
    5  c     12
    3  c      8
    1  a      4
    5  c     12
    2  b      6
    1  a      4
    1  a      4
    2  b      6
    

    Proportion with N = 100:

    ID
    b    0.34
    a    0.33
    c    0.33
    Name: proportion, dtype: float64