Search code examples
pythonpandasdataframesubtraction

Pandas : Cannot subtract dataframes


I have 2 simple data frames

df1

                                             colA    colB
0   e5b1b9fc-ade9-4501-a66b-ef2ecd57483e.d9967f258...  2ZWR52QYZ86H
1   8d127d82-cfa4-421f-9081-cf35132b8248.f0865b3b9...  61RPLMR5BFFT
2   005c8e84-98b4-402d-a24e-6a63edad0598.16b6f0f9f...  7L256IQTB1M1
3   d87f6dfd-1c55-4ce5-9b84-e80b6aa958d8.3f0901c7f...  3H9SLNATBJ01
4   cf89c9dd-004e-40e7-8120-3397ce5fd97e.f571bc175...  4Z8RT5VZNOQ8
5   9eebc606-e8d0-40e3-9ba5-6d3e1b77bc64.0dc42d528...  1DEOAHZL2JFC
6   7112aef1-5fa0-4459-aa1b-15cba2f96ec5.6a9ecb28d...  2CIISYGAAV69
7   e30d901c-34e6-4974-9b9e-1fe206ed6fca.701f1358e...  2NLLJ70RXKW2
8   13677989-8979-4422-a471-7fda22ea4f6d.e00051e45...  6P60G721DVHK

df2

0   e5b1b9fc-ade9-4501-a66b-ef2ecd57483e.d9967f258...  2ZWR52QYZ86H
1   8d127d82-cfa4-421f-9081-cf35132b8248.f0865b3b9...  61RPLMR5BFFT
2   005c8e84-98b4-402d-a24e-6a63edad0598.16b6f0f9f...  7L256IQTB1M1
3   d87f6dfd-1c55-4ce5-9b84-e80b6aa958d8.3f0901c7f...  3H9SLNATBJ01
4   cf89c9dd-004e-40e7-8120-3397ce5fd97e.f571bc175...  4Z8RT5VZNOQ8

now, i want to isolate the rows in df1, that are not in df2

so i tried

df1.subtract(df2)

but i get

result[mask] = op(xrav[mask], yrav[mask])
TypeError: unsupported operand type(s) for -: 'str' and 'str'

What am i doing wrong?


Solution

  • You can use merge to get the rows in df1 that are not in df2.

    Using indicator parameter to check which rows are found in both data frames, only in the left data frame, or only in the right data frame.

    merged_df = df1.merge(df2, on=['colA', 'colB'], how='outer', indicator=True)
    result_df = merged_df.loc[merged_df['_merge'] == 'left_only']
    

    Example output:

    colA colB _merge
    5 9eebc606-e8d0-40e3-9ba5-6d3e1b77bc64.0dc42d528 1DEOAHZL2JFC left_only
    6 7112aef1-5fa0-4459-aa1b-15cba2f96ec5.6a9ecb28d 2CIISYGAAV69 left_only
    7 e30d901c-34e6-4974-9b9e-1fe206ed6fca.701f1358e 2NLLJ70RXKW2 left_only
    8 13677989-8979-4422-a471-7fda22ea4f6d.e00051e45 6P60G721DVHK left_only

    About your comment:

    this worked for me concat_df = pd.concat([df1,df2]).drop_duplicates(keep=False, inplace=False)

    That option is only valid if df2 doesn't have rows that aren't in df1 (All rows in df2 are in df1). If both df1 and df2 have unique rows, using pd.concat() and dropping duplicates won't only isolate the rows in df1 that are not in df2 as you requested in your question, you would have all rows unique in df2 too