I have df1:
trip_id latitude longitude datetime
0 356a192b7913b04c54574d18c28d46e6395428ab 57.723610 11.925191 2021-06-13 14:22:11.682
1 356a192b7913b04c54574d18c28d46e6395428ab 57.723614 11.925187 2021-06-13 14:22:13.562
2 356a192b7913b04c54574d18c28d46e6395428ab 57.723610 11.925172 2021-06-13 14:22:28.635
3 da4b9237bacccdf19c0760cab7aec4a8359010b0 57.723637 11.925056 2021-06-13 14:22:59.336
4 da4b9237bacccdf19c0760cab7aec4a8359010b0 57.724075 11.923708 2021-06-13 14:23:44.905
5 77de68daecd823babbb58edb1c8e14d7106e83bb 57.723610 11.925191 2021-06-13 14:22:04.000
6 77de68daecd823babbb58edb1c8e14d7106e83bb 57.723614 11.925178 2021-06-13 14:22:44.170
7 77de68daecd823babbb58edb1c8e14d7106e83bb 57.723827 11.924635 2021-06-13 14:23:14.479
8 77de68daecd823babbb58edb1c8e14d7106e83bb 57.723866 11.924005 2021-06-13 14:23:29.605
and df2:
trip_id latitude longitude datetime
0 356a192b7913b04c54574d18c28d46e6395428ab 47.723610 14.925187 2021-06-13 14:22:59.336
1 555a192b7913b04c54574d18c28d46e639542yyy 60.723610 12.925187 2021-06-13 14:22:59.336
2 77de68daecd823babbb58edb1c8e14d7106e83cc 58.993066 13.924005 2022-06-13 20:23:29.605
I want to check if the rows in df2 exist in df1. For this I tried:
df2.isin(df1)
>>
trip_id latitude longitude datetime
0 True False False False
1 False False False False
2 False False False False
The end goal is to add the observations (rows) from df2 to df1 if they don't exist. However, as True is returned for the index 0 in trip_id not the entire first row of df2 would be identified as new in df1. As long as there is one element different in the observations, it should be added as a new row in df1.
To add the new rows in df2 to df1, I would use:
df3 = df1.append(df2, ignore_index=True)
So ultimately, the final code should be:
if df2.isin(df1):
df3 = df1.append(df2, ignore_index=True)
Your example data is a bit confusing, you want to reject duplicates in df2
, but df1
already has duplicates (by trip_id
); e.g. the first 3 rows of df1
have the same trip_id
. So the goal is a bit unclear.
Also, the construct you expect:
if df2.isin(df1):
df3 = df1.append(df2, ignore_index=True)
is wrong. As your example shows pd.DataFrame.isin
will return a dataframe of booleans, and the truth value of a pandas/numpy array isn't defined. You will get this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
So going by this description:
As long as there is one element different in the observations, it should be added as a new row in df1.
I suggest the following:
df1.append(df2[~df1.isin(df2).all(axis=1)])
However, if you are doing this many times, I would suggest refactoring your code into doing something like this:
updates = [df2[...], df3[...], ..., dfn[...]]
result = pd.concat([df1, *updates]], axis=0)