I have 2 dataframes:
id | name |
---|---|
1 | name1 |
2 | name2 |
3 | name3 |
4 | name4 |
id | total |
---|---|
1 | 10 |
1 | 24 |
1 | 33 |
2 | 14 |
2 | 21 |
3 | 30 |
4 | 1 |
4 | 29 |
4 | 31 |
I want to be able to remove items from df1 based on if the corresponding id in df2 'total' is greater than a certain value.
I tried using creating a boolean mask using DataFrame.apply
which was very slow:
def fn_should_drop(row, check_df):
df_match = check_df.loc[check_df["id"] == row["id"]]
max_for_id = df_match["id"].max()
max_value = 25
return max_for_id >= max_value
mask = df1.apply(fn_should_drop, check_df=df2, axis=1)
df_result = df1[mask]
Is there a way to achieve this using vectorization (eg np.where
)?
Here's one approach:
df2['id']
where df2['total'] > max_value
using df.loc
(boolean indexing).df1['id']
keep the inverse matches: ~
+ Series.isin
:df_result = df1.loc[~df1['id'].isin(df2.loc[df2['total'] > max_value, 'id'])]
Output:
id name
1 2 name2
(change >
to >=
if you meant that. Your Q seems ambiguous.)
Data used
import pandas as pd
data = {'id': {0: 1, 1: 2, 2: 3, 3: 4},
'name': {0: 'name1', 1: 'name2', 2: 'name3', 3: 'name4'}}
df1 = pd.DataFrame(data)
data2 = {'id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 4, 7: 4, 8: 4},
'total': {0: 10, 1: 24, 2: 33, 3: 14, 4: 21, 5: 30, 6: 1, 7: 29, 8: 31}}
df2 = pd.DataFrame(data2)