Search code examples
pythonpandasnumpy

Filter a dataframe based on another dataframe


I have 2 dataframes:

df 1 =

id name
1 name1
2 name2
3 name3
4 name4

df 2 =

id total
1 10
1 24
1 33
2 14
2 21
3 30
4 1
4 29
4 31

I want to be able to remove items from df1 based on if the corresponding id in df2 'total' is greater than a certain value.

I tried using creating a boolean mask using DataFrame.apply which was very slow:

  def fn_should_drop(row, check_df):
    df_match = check_df.loc[check_df["id"] == row["id"]]

    max_for_id = df_match["id"].max()
    max_value = 25
    return max_for_id >= max_value

mask = df1.apply(fn_should_drop, check_df=df2, axis=1)
df_result = df1[mask]

Is there a way to achieve this using vectorization (eg np.where)?


Solution

  • Here's one approach:

    df_result = df1.loc[~df1['id'].isin(df2.loc[df2['total'] > max_value, 'id'])]
    

    Output:

       id   name
    1   2  name2
    

    (change > to >= if you meant that. Your Q seems ambiguous.)

    Data used

    import pandas as pd
    
    data = {'id': {0: 1, 1: 2, 2: 3, 3: 4}, 
            'name': {0: 'name1', 1: 'name2', 2: 'name3', 3: 'name4'}}
    df1 = pd.DataFrame(data)
    
    data2 = {'id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 4, 7: 4, 8: 4}, 
             'total': {0: 10, 1: 24, 2: 33, 3: 14, 4: 21, 5: 30, 6: 1, 7: 29, 8: 31}}
    df2 = pd.DataFrame(data2)