Search code examples
python-3.xpandasdataframemergeinner-join

Python3 Pandas - how to merge two DataFrames that contain multiple rows with the same key


I am attempting to merge two DataFrames via their matching keys, but the keys may occur multiple (n) times in each DataFrame. The inner join gives rows for all n^2 pairings of keys - instead, I want n rows.

For some context: Imagine a check-in/check-out scenario for a library, where a book may be checked in, which is recorded in df1, or checked out, and recorded in df2. Each book has a unique key, but may be checked-in/out multiple times. Also, as the datasets only span a certain time window, some books may have a check-in record but not a check-out (books that had been checked out before the data was recorded), or a check-out record but not a check-in (books that have yet to be returned). My goal is to create a new DataFrame that only includes rows where there is both a check-in and corresponding check-out.

So, to finally get to my questions:

1) How can I perform an inner join, where the first check-in is combined with first check-out, second check-in with second check-out, etc? By default, it gives all combinations - so if there are n check-ins and n check-outs, I get n^2 rows instead of the n rows that I want. (For now, let's ignore the possibility that a check-out precedes the first check-in, or an unequal number of check-ins/check-outs.) A simple example is below.

df1 = pd.DataFrame({'ID': ['A1', 'A2','A2', 'A3'], 'DATE': [1, 1,2, 2]})
df2 = pd.DataFrame({'ID': ['A2', 'A3', 'A2', 'A4'], 'DATE': [3, 5, 5, 7]})
df = pd.merge(df1, df2, how='inner', on='ID',sort=True)

example script

Note that I get 4 entries for A2 and 1 for A3, whereas I only want rows 0 and 2 for A2 along with row 4 for A3.

df_wanted = pd.DataFrame({'ID': ['A2', 'A2', 'A3'], 'DATE_x': [1, 2, 2], 'DATE_y': [3, 5, 5]})

2) The full case. Each check-in and check-out should be paired, with each check-out being paired with the check-in that is nearest before it. So if there are check-ins on days 1 and 2, and check-outs on days 0, 3 and 5, the only rows in the final df should correspond to the second two check-ins (days 1,2) and second two check-outs (days 3,5).

df1 = pd.DataFrame({'ID': ['A1', 'A2','A2', 'A3'], 'DATE': [1, 1, 2, 2]})
df2 = pd.DataFrame({'ID': ['A2', 'A3', 'A2', 'A4','A2'], 'DATE': [3, 5, 5, 7, 0]})
df = pd.merge(df1, df2, how='inner', on='ID',sort=True)

I added a check-out at day 0 to df2. Now in df, I get 6 entries for A2 (and the one for A3), whereas I only want 2 entries for A2 (and the one for A3). This should result in the same df_wanted as above.

NOTE: the posted answer will match check-ins from days 1,2 to checkouts on days 0,3 rather than checkouts on days 3,5. So full solution will need to ensure date of checkout >= date of check-in, or start the counter at earliest check-in, or something similar.

What I have tried: I tried implementing various combinations of df.drop_duplicates(), but I end up with the incorrect combinations. I also tried doing this manually by looping through all IDs which are common to both datasets (common=set(df1.ID.values) & set(df2.ID.values)), pairing them up by their order of occurrence, and then adding them to a new df one-by-one, but this seems rather inefficient.

This seems like a common enough task that there might be a more "pythonic" way of handling this?

Thanks for taking the time, and I appreciate any guidance or tips with this.


Solution

  • Here is my solution:

    import pandas as pd
    df1 = pd.DataFrame({'date':[1, 1, 2, 2], 'id':['A1', 'A2', 'A2', 'A3']})
    df2 = pd.DataFrame({'date':[3, 5, 5, 7], 'id':['A2', 'A3', 'A2', 'A4']})
    
    df1 = df1[df1.id.isin(df2.id)]
    df2 = df2[df2.id.isin(df1.id)]
    
    df1['ones'] = 1
    df1['counter'] = df1.groupby('id')['ones'].cumsum()
    del df1['ones']
    
    df2['ones'] = 1
    df2['counter'] = df2.groupby('id')['ones'].cumsum()
    del df2['ones']
    
    
    df3 = pd.merge(df1, df2, on=['id', 'counter'], suffixes = ['_checkin', '_checkout'])
    del df3['counter']
    
    print(df3)
    
       date_checkin  id  date_checkout
    0             1  A2              3
    1             2  A2              5
    2             2  A3              5
    

    Steps:

    Initialize dataframes:

    import pandas as pd
    df1 = pd.DataFrame({'date':[1, 1, 2, 2], 'id':['A1', 'A2', 'A2', 'A3']})
    df2 = pd.DataFrame({'date':[3, 5, 5, 7], 'id':['A2', 'A3', 'A2', 'A4']})
    

    Filter down dataframes for only records that exist in each other, by id

    df1 = df1[df1.id.isin(df2.id)]
    df2 = df2[df2.id.isin(df1.id)]
    

    Create a cumsum counter to match on. This is where we will match on "first checkin" with "first checkout"

    df1['ones'] = 1
    df1['counter'] = df1.groupby('id')['ones'].cumsum()
    del df1['ones']
    
    df2['ones'] = 1
    df2['counter'] = df2.groupby('id')['ones'].cumsum()
    del df2['ones']
    

    Now we can do an inner join on both id and counter

    df3 = pd.merge(df1, df2, on=['id', 'counter'], suffixes = ['_checkin', '_checkout'])
    del df3['counter']
    
    print(df3)
    
       date_checkin  id  date_checkout
    0             1  A2              3
    1             2  A2              5
    2             2  A3              5
    

    I hope this helps!