Search code examples
pythonfor-loopsubset

Pick up records from one dataframe with conditions based on other dataframe in Python


I have 2 dataframes, namely df_a and df_b:

df_a

country type risk_level sample
NO A HR 3.0
NO B LR 2.0
DK C HR 1.0

df_b

caseid country type risk_level user end_user
1 DK A HR Jess 1
2 NO B LR James 1
3 DK C HR Tom 2
3 DK B LR Anna 1

Rows in df_a are unique. I am trying to filter df_b based on conditions being each row of df_a, and the number of rows that should be picked up from df_b is equal to the value from column 'sample' of df_a.

For example, 3 cases with 'country' being 'NO', 'type' being 'A', and 'risk_level' being 'HR' should be picked up from df_b.

Here is my code:

sample_list = []
for i in df_a['country']:
    for j in df_a['type']:
        for k in df_a['risk_level']:
            filter_df_b = df_b[(df_b['country'] == i) & (df_b['type'] == j) & (df_b['risk_level'] == k)].sort_values(by='end_user', ascending=True)
            sample_nums = int(df_a.loc[(df_a['country'] == i) & (df_a['type'] == j) & (df_a['risk_level'] == k), 'sample'].values)
            sample_list.append(filter_df_b.head(sample_nums))

sample_list

The error was 'only size-1 arrays can be converted to Python scalars'. I know that it is related to sample_nums = int(...).

However, I don't understand why? As i, j, k are chosen one time each, the value of sample_nums should be a single value for each loop?!

I tried to change it to:

sample_nums = df_a.loc[(df_a['country'] == i) & (df_a['type'] == j) & (df_a['risk_level'] == k), 'sample'].values.astype('int')

But it didn't work either.


Solution

  • You can use merge to select only valid rows from dfb then use groupby_apply to get expected samples:

    num_of_samples = lambda x: x.sample(n=min(len(x), x['sample'].iloc[0]))
    cols = dfa.columns[:-1].tolist()
    
    out = (dfb.merge(dfa, on=cols)
              .groupby(cols, group_keys=False)
              .apply(num_of_samples))
    

    Output:

    >>> out
       caseid country type risk_level   user  end_user  sample
    0       2      NO    B         LR  James         1     2.0  # not enough sample
    1       3      DK    C         HR    Tom         2     1.0
    

    Note about min(len(x), x['sample'].iloc[0]). If you don't have enough samples, use all available rows for the current group.