Pick up records from one dataframe with conditions based on other dataframe in Python

I have 2 dataframes, namely df_a and df_b:

df_a

country	type	risk_level	sample
NO	A	HR	3.0
NO	B	LR	2.0
DK	C	HR	1.0

df_b

caseid	country	type	risk_level	user	end_user
1	DK	A	HR	Jess	1
2	NO	B	LR	James	1
3	DK	C	HR	Tom	2
3	DK	B	LR	Anna	1

Rows in df_a are unique. I am trying to filter df_b based on conditions being each row of df_a, and the number of rows that should be picked up from df_b is equal to the value from column 'sample' of df_a.

For example, 3 cases with 'country' being 'NO', 'type' being 'A', and 'risk_level' being 'HR' should be picked up from df_b.

Here is my code:

sample_list = []
for i in df_a['country']:
    for j in df_a['type']:
        for k in df_a['risk_level']:
            filter_df_b = df_b[(df_b['country'] == i) & (df_b['type'] == j) & (df_b['risk_level'] == k)].sort_values(by='end_user', ascending=True)
            sample_nums = int(df_a.loc[(df_a['country'] == i) & (df_a['type'] == j) & (df_a['risk_level'] == k), 'sample'].values)
            sample_list.append(filter_df_b.head(sample_nums))

sample_list

The error was 'only size-1 arrays can be converted to Python scalars'. I know that it is related to sample_nums = int(...).

However, I don't understand why? As i, j, k are chosen one time each, the value of sample_nums should be a single value for each loop?!

I tried to change it to:

sample_nums = df_a.loc[(df_a['country'] == i) & (df_a['type'] == j) & (df_a['risk_level'] == k), 'sample'].values.astype('int')

But it didn't work either.

Solution

You can use merge to select only valid rows from dfb then use groupby_apply to get expected samples:

num_of_samples = lambda x: x.sample(n=min(len(x), x['sample'].iloc[0]))
cols = dfa.columns[:-1].tolist()

out = (dfb.merge(dfa, on=cols)
          .groupby(cols, group_keys=False)
          .apply(num_of_samples))

Output:

>>> out
   caseid country type risk_level   user  end_user  sample
0       2      NO    B         LR  James         1     2.0  # not enough sample
1       3      DK    C         HR    Tom         2     1.0

Note about min(len(x), x['sample'].iloc[0]). If you don't have enough samples, use all available rows for the current group.