Search code examples
pythonpandassample

Pandas - sample many groups with different proportions


I need to sample a dataframe by group using different proportions for each group. I have more than 100 groups, but for sake of simplicity my example has just 3 groups. Let's suppose I have this dataframe:

df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
               'vals': np.random.randn(120)})
N = len(df2)
df2.groupby('group_id').count()

#           vals
#group_id   
#A         40
#B         60
#C         20

And I want to sample groups A, B and C using the dataframe below for proportion:

 prop = pd.DataFrame({'A': {0.45},
                      'B': {0.55},
                      'C': {0.62}})

When I try to sample, I get an error:

grouped = df2.groupby('group_id')
x = grouped.apply(lambda x: x.sample(frac=props))

error: NameError: global name 'props' is not defined

Any help is highly appreciated! Thanks


Solution

  • I think need DataFrame of scalar and then lookup by x.name:

    prop = pd.DataFrame({'A': [0.45],
                          'B': [0.55],
                          'C': [0.62]})
    
    grouped = df2.groupby('group_id')
    x = grouped.apply(lambda x: x.sample(frac=prop[x.name]))
    print (x.head(20))
                group_id      vals
    group_id                      
    A        19        A  1.157552
             37        A  0.086347
             0         A -0.668129
             8         A -0.345811
             27        A -0.301085
             14        A -0.325130
             6         A  0.301966
             15        A  1.944702
             4         A  1.350509
             1         A -0.498210
             2         A  0.618576
             31        A -0.274381
             16        A  1.915676
             25        A  0.136372
             32        A  0.864837
             9         A -0.315231
             20        A -0.106208
             34        A  1.324797
    B        85        B -0.861647
             55        B -0.079275