Search code examples
pythonpandasdataframeappendisin

Pandas: Issue using isin() to check if data frame observations exist in other dataframe


I have df1:

    trip_id                                     latitude    longitude   datetime
0   356a192b7913b04c54574d18c28d46e6395428ab    57.723610   11.925191   2021-06-13 14:22:11.682
1   356a192b7913b04c54574d18c28d46e6395428ab    57.723614   11.925187   2021-06-13 14:22:13.562
2   356a192b7913b04c54574d18c28d46e6395428ab    57.723610   11.925172   2021-06-13 14:22:28.635
3   da4b9237bacccdf19c0760cab7aec4a8359010b0    57.723637   11.925056   2021-06-13 14:22:59.336
4   da4b9237bacccdf19c0760cab7aec4a8359010b0    57.724075   11.923708   2021-06-13 14:23:44.905
5   77de68daecd823babbb58edb1c8e14d7106e83bb    57.723610   11.925191   2021-06-13 14:22:04.000
6   77de68daecd823babbb58edb1c8e14d7106e83bb    57.723614   11.925178   2021-06-13 14:22:44.170
7   77de68daecd823babbb58edb1c8e14d7106e83bb    57.723827   11.924635   2021-06-13 14:23:14.479
8   77de68daecd823babbb58edb1c8e14d7106e83bb    57.723866   11.924005   2021-06-13 14:23:29.605

and df2:

    trip_id                                     latitude    longitude   datetime
0   356a192b7913b04c54574d18c28d46e6395428ab    47.723610   14.925187   2021-06-13 14:22:59.336
1   555a192b7913b04c54574d18c28d46e639542yyy    60.723610   12.925187   2021-06-13 14:22:59.336
2   77de68daecd823babbb58edb1c8e14d7106e83cc    58.993066   13.924005   2022-06-13 20:23:29.605


I want to check if the rows in df2 exist in df1. For this I tried:

df2.isin(df1)

>>

 trip_id  latitude  longitude   datetime
0   True    False   False   False
1   False   False   False   False
2   False   False   False   False


The end goal is to add the observations (rows) from df2 to df1 if they don't exist. However, as True is returned for the index 0 in trip_id not the entire first row of df2 would be identified as new in df1. As long as there is one element different in the observations, it should be added as a new row in df1.

To add the new rows in df2 to df1, I would use:

df3 = df1.append(df2, ignore_index=True)

So ultimately, the final code should be:

if df2.isin(df1):
   df3 = df1.append(df2, ignore_index=True)


Solution

  • Your example data is a bit confusing, you want to reject duplicates in df2, but df1 already has duplicates (by trip_id); e.g. the first 3 rows of df1 have the same trip_id. So the goal is a bit unclear.

    Also, the construct you expect:

    if df2.isin(df1):
       df3 = df1.append(df2, ignore_index=True)
    

    is wrong. As your example shows pd.DataFrame.isin will return a dataframe of booleans, and the truth value of a pandas/numpy array isn't defined. You will get this error:

    ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
    

    So going by this description:

    As long as there is one element different in the observations, it should be added as a new row in df1.

    I suggest the following:

    df1.append(df2[~df1.isin(df2).all(axis=1)])
    

    However, if you are doing this many times, I would suggest refactoring your code into doing something like this:

    updates = [df2[...], df3[...], ..., dfn[...]]
    result = pd.concat([df1, *updates]], axis=0)