I need to sample a dataframe by group using different proportions for each group. I have more than 100 groups, but for sake of simplicity my example has just 3 groups. Let's suppose I have this dataframe:
df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
'vals': np.random.randn(120)})
N = len(df2)
df2.groupby('group_id').count()
# vals
#group_id
#A 40
#B 60
#C 20
And I want to sample groups A, B and C using the dataframe below for proportion:
prop = pd.DataFrame({'A': {0.45},
'B': {0.55},
'C': {0.62}})
When I try to sample, I get an error:
grouped = df2.groupby('group_id')
x = grouped.apply(lambda x: x.sample(frac=props))
error: NameError: global name 'props' is not defined
Any help is highly appreciated! Thanks
I think need DataFrame of scalar and then lookup by x.name
:
prop = pd.DataFrame({'A': [0.45],
'B': [0.55],
'C': [0.62]})
grouped = df2.groupby('group_id')
x = grouped.apply(lambda x: x.sample(frac=prop[x.name]))
print (x.head(20))
group_id vals
group_id
A 19 A 1.157552
37 A 0.086347
0 A -0.668129
8 A -0.345811
27 A -0.301085
14 A -0.325130
6 A 0.301966
15 A 1.944702
4 A 1.350509
1 A -0.498210
2 A 0.618576
31 A -0.274381
16 A 1.915676
25 A 0.136372
32 A 0.864837
9 A -0.315231
20 A -0.106208
34 A 1.324797
B 85 B -0.861647
55 B -0.079275