I am attempting to merge two DataFrames via their matching keys, but the keys may occur multiple (n) times in each DataFrame. The inner join gives rows for all n^2 pairings of keys - instead, I want n rows.
For some context: Imagine a check-in/check-out scenario for a library, where a book may be checked in, which is recorded in df1, or checked out, and recorded in df2. Each book has a unique key, but may be checked-in/out multiple times. Also, as the datasets only span a certain time window, some books may have a check-in record but not a check-out (books that had been checked out before the data was recorded), or a check-out record but not a check-in (books that have yet to be returned). My goal is to create a new DataFrame that only includes rows where there is both a check-in and corresponding check-out.
So, to finally get to my questions:
1) How can I perform an inner join, where the first check-in is combined with first check-out, second check-in with second check-out, etc? By default, it gives all combinations - so if there are n check-ins and n check-outs, I get n^2 rows instead of the n rows that I want. (For now, let's ignore the possibility that a check-out precedes the first check-in, or an unequal number of check-ins/check-outs.) A simple example is below.
df1 = pd.DataFrame({'ID': ['A1', 'A2','A2', 'A3'], 'DATE': [1, 1,2, 2]})
df2 = pd.DataFrame({'ID': ['A2', 'A3', 'A2', 'A4'], 'DATE': [3, 5, 5, 7]})
df = pd.merge(df1, df2, how='inner', on='ID',sort=True)
Note that I get 4 entries for A2 and 1 for A3, whereas I only want rows 0 and 2 for A2 along with row 4 for A3.
df_wanted = pd.DataFrame({'ID': ['A2', 'A2', 'A3'], 'DATE_x': [1, 2, 2], 'DATE_y': [3, 5, 5]})
2) The full case. Each check-in and check-out should be paired, with each check-out being paired with the check-in that is nearest before it. So if there are check-ins on days 1 and 2, and check-outs on days 0, 3 and 5, the only rows in the final df should correspond to the second two check-ins (days 1,2) and second two check-outs (days 3,5).
df1 = pd.DataFrame({'ID': ['A1', 'A2','A2', 'A3'], 'DATE': [1, 1, 2, 2]})
df2 = pd.DataFrame({'ID': ['A2', 'A3', 'A2', 'A4','A2'], 'DATE': [3, 5, 5, 7, 0]})
df = pd.merge(df1, df2, how='inner', on='ID',sort=True)
I added a check-out at day 0 to df2. Now in df, I get 6 entries for A2 (and the one for A3), whereas I only want 2 entries for A2 (and the one for A3). This should result in the same df_wanted
as above.
NOTE: the posted answer will match check-ins from days 1,2 to checkouts on days 0,3 rather than checkouts on days 3,5. So full solution will need to ensure date of checkout >= date of check-in, or start the counter at earliest check-in, or something similar.
What I have tried:
I tried implementing various combinations of df.drop_duplicates()
, but I end up with the incorrect combinations. I also tried doing this manually by looping through all IDs which are common to both datasets (common=set(df1.ID.values) & set(df2.ID.values)
), pairing them up by their order of occurrence, and then adding them to a new df one-by-one, but this seems rather inefficient.
This seems like a common enough task that there might be a more "pythonic" way of handling this?
Thanks for taking the time, and I appreciate any guidance or tips with this.
Here is my solution:
import pandas as pd
df1 = pd.DataFrame({'date':[1, 1, 2, 2], 'id':['A1', 'A2', 'A2', 'A3']})
df2 = pd.DataFrame({'date':[3, 5, 5, 7], 'id':['A2', 'A3', 'A2', 'A4']})
df1 = df1[df1.id.isin(df2.id)]
df2 = df2[df2.id.isin(df1.id)]
df1['ones'] = 1
df1['counter'] = df1.groupby('id')['ones'].cumsum()
del df1['ones']
df2['ones'] = 1
df2['counter'] = df2.groupby('id')['ones'].cumsum()
del df2['ones']
df3 = pd.merge(df1, df2, on=['id', 'counter'], suffixes = ['_checkin', '_checkout'])
del df3['counter']
print(df3)
date_checkin id date_checkout
0 1 A2 3
1 2 A2 5
2 2 A3 5
Steps:
Initialize dataframes:
import pandas as pd
df1 = pd.DataFrame({'date':[1, 1, 2, 2], 'id':['A1', 'A2', 'A2', 'A3']})
df2 = pd.DataFrame({'date':[3, 5, 5, 7], 'id':['A2', 'A3', 'A2', 'A4']})
Filter down dataframes for only records that exist in each other, by id
df1 = df1[df1.id.isin(df2.id)]
df2 = df2[df2.id.isin(df1.id)]
Create a cumsum
counter to match on. This is where we will match on "first checkin" with "first checkout"
df1['ones'] = 1
df1['counter'] = df1.groupby('id')['ones'].cumsum()
del df1['ones']
df2['ones'] = 1
df2['counter'] = df2.groupby('id')['ones'].cumsum()
del df2['ones']
Now we can do an inner join
on both id
and counter
df3 = pd.merge(df1, df2, on=['id', 'counter'], suffixes = ['_checkin', '_checkout'])
del df3['counter']
print(df3)
date_checkin id date_checkout
0 1 A2 3
1 2 A2 5
2 2 A3 5
I hope this helps!