Search code examples
pythonpandasdataframejoin

Pandas EXCLUSIVE LEFT OUTER JOIN with line count


I am creating a transactions import tool that updates a DB with new transactions every day.

I am getting an Excel file (that I am opening using pandas) with the entire month transactions and I am trying to filter only the new transactions by merging the new DataFrame with the existing one.

For this I am using pandas merge to do EXCULSIVE LEFT OUTER JOIN but I have a problem with multiple rows with the same exact values.

See this example:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[pd.Timestamp('2023-1-1'), 'A', 10]
                            , [pd.Timestamp('2023-1-1'), 'A', 10]
                            , [pd.Timestamp('2023-1-1'), 'B', 11]
                            , [pd.Timestamp('2023-1-2'), 'C', 12]
                            , [pd.Timestamp('2023-1-2'), 'D', 13]
                            , [pd.Timestamp('2023-1-2'), 'E', 14]
                            , [pd.Timestamp('2023-1-3'), 'F', 15]]),
                   columns=['Date', 'Title', 'Amount'])

df2 = pd.DataFrame(np.array([[pd.Timestamp('2023-1-1'), 'A', 10]
                            , [pd.Timestamp('2023-1-1'), 'B', 11]
                            , [pd.Timestamp('2023-1-2'), 'C', 12]]),
                   columns=['Date', 'Title', 'Amount'])

df3 = pd.merge(df1, df2, on=['Date', 'Title', 'Amount'], how="outer", indicator=True)
df3 = df3[df3['_merge'] == 'left_only']

print(df1)
print(df2)
print(df3) # Both 'A' rows deleted while one 'A' row is new and should be in df3

The output is:

       Date Title Amount
0 2023-01-01     A     10
1 2023-01-01     A     10
2 2023-01-01     B     11
3 2023-01-02     C     12
4 2023-01-02     D     13
5 2023-01-02     E     14
6 2023-01-03     F     15
        Date Title Amount
0 2023-01-01     A     10
1 2023-01-01     B     11
2 2023-01-02     C     12
        Date Title Amount     _merge
4 2023-01-02     D     13  left_only
5 2023-01-02     E     14  left_only
6 2023-01-03     F     15  left_only

With the above method, both 'A' rows are deleted while one 'A' row is new and thus should be in the new DataFrame.

Any ideas on what operation can be used to keep only the rows that in the first Dataframe with consideration to rows count? To give a little more information, transactions in the same day are not ordered (no time information, only date) and new transactions can be added multiple days in the past.


Solution

  • I think you most likely want this :

    df3 = (
        pd.merge(df1, df2,
                 left_on = [*list(df1), df1.groupby(list(df1)).cumcount()],
                 right_on= [*list(df2), df2.groupby(list(df2)).cumcount()],
                 how="outer", indicator=True)
            .query("_merge == 'left_only'")
            .drop(columns="key_3")
    )
    

    Output :

    print(df3)
    
            Date Title Amount     _merge
    1 2023-01-01     A     10  left_only
    4 2023-01-02     D     13  left_only
    5 2023-01-02     E     14  left_only
    6 2023-01-03     F     15  left_only