I have 2 dataframes, namely df_a and df_b:
df_a
country | type | risk_level | sample |
---|---|---|---|
NO | A | HR | 3.0 |
NO | B | LR | 2.0 |
DK | C | HR | 1.0 |
df_b
caseid | country | type | risk_level | user | end_user |
---|---|---|---|---|---|
1 | DK | A | HR | Jess | 1 |
2 | NO | B | LR | James | 1 |
3 | DK | C | HR | Tom | 2 |
3 | DK | B | LR | Anna | 1 |
Rows in df_a are unique. I am trying to filter df_b based on conditions being each row of df_a, and the number of rows that should be picked up from df_b is equal to the value from column 'sample' of df_a.
For example, 3 cases with 'country' being 'NO', 'type' being 'A', and 'risk_level' being 'HR' should be picked up from df_b.
Here is my code:
sample_list = []
for i in df_a['country']:
for j in df_a['type']:
for k in df_a['risk_level']:
filter_df_b = df_b[(df_b['country'] == i) & (df_b['type'] == j) & (df_b['risk_level'] == k)].sort_values(by='end_user', ascending=True)
sample_nums = int(df_a.loc[(df_a['country'] == i) & (df_a['type'] == j) & (df_a['risk_level'] == k), 'sample'].values)
sample_list.append(filter_df_b.head(sample_nums))
sample_list
The error was 'only size-1 arrays can be converted to Python scalars'. I know that it is related to sample_nums = int(...).
However, I don't understand why? As i, j, k are chosen one time each, the value of sample_nums should be a single value for each loop?!
I tried to change it to:
sample_nums = df_a.loc[(df_a['country'] == i) & (df_a['type'] == j) & (df_a['risk_level'] == k), 'sample'].values.astype('int')
But it didn't work either.
You can use merge
to select only valid rows from dfb
then use groupby_apply
to get expected samples:
num_of_samples = lambda x: x.sample(n=min(len(x), x['sample'].iloc[0]))
cols = dfa.columns[:-1].tolist()
out = (dfb.merge(dfa, on=cols)
.groupby(cols, group_keys=False)
.apply(num_of_samples))
Output:
>>> out
caseid country type risk_level user end_user sample
0 2 NO B LR James 1 2.0 # not enough sample
1 3 DK C HR Tom 2 1.0
Note about min(len(x), x['sample'].iloc[0])
. If you don't have enough samples, use all available rows for the current group.