Search code examples
pythonpandasdataframeouter-join

Pandas outer join with no duplicates adds new rows


I have 2 dataframes preds and assets_to_remove.

This is how the dataframe preds looks:

               asset_id        asset_name
294771  493646671302244         queue_bar
294770  503848157271852   refactor_target
294769  786314528522899   submission_tray
294768  206472013793428  state_subscriber
294767  510707746509671         format_gk
...                 ...               ...
4       688122571800214          v2_reads
3       323798285353466 products_v2_reads
2       395943214896870   update_protocol
1       680526449474908          fix_v153
0       349458202857963   adjustment_v159

[294772 rows x 2 columns]

This is how the dataframe assets_to_remove looks:

            asset_id
0    513454469563578
1    829695400866900
2    764696234441014
3    195100021778259
4    368797654574209
..               ...
20   237207674121701
21   135774837852816
22  2453638234940010
23   705229516884471
24   343619773239104

[1995 rows x 1 columns]

Neither of these 2 dataframes have a row with asset_id equal to 57412518735315968. Checking preds:

print(preds[preds.asset_id.eq(57412518735315968)])
Empty DataFrame
Columns: [asset_id, asset_name]
Index: []

Checking assets_to_remove:

print(assets_to_remove[assets_to_remove.asset_id.eq(57412518735315968)])
Empty DataFrame
Columns: [asset_id]
Index: []

Now I do an outer join on these 2 dataframes:

z = pd.merge(preds,assets_to_remove,on="asset_id",how="outer",indicator="source").astype({"asset_id": "int64"})

Gives a result dataframe like this:


                asset_id  ...      source
0        493646671302244  ...   left_only
1        503848157271852  ...   left_only
2        786314528522899  ...   left_only
3        206472013793428  ...   left_only
4        510707746509671  ...   left_only
...                  ...  ...         ...
296016   743251236547292  ...  right_only
296017   890822734697339  ...  right_only
296018   274927503757939  ...  right_only
296019   943962539702954  ...  right_only
296020  2453638234940010  ...  right_only

[296021 rows x 3 columns]

This dataframe that has a row with asset id 57412518735315968!

print(z[z.asset_id.eq(57412518735315968)])
                 asset_id                             asset_name     source
216128  57412518735315968  storefront_ig_new_menu_items_internal  left_only

How is this possible? Neither of the 2 dataframes have this value. I also made sure there are no duplicate rows in both the dataframes. Can someone please shed some light on this?


Solution

  • Check datatype on asset_id on dataframes before the merge, is it int64 in both cases?

    This issue could be possible as before the merge you are comparing to numeric value 57412518735315968, if type in original dataframes is not int64, and instead is object, then your equality check will not be returning a matching row.

    In the merge step you are explicitly changing the data type of asset_id to int64, equality check will pass in this case.