I have 2 dataframes that I am trying to merge based on IDs and a secondary ID. Here are a sample of the two dataframes:
First ID Second ID Company
10056526008010 0.000000e+00 Company A
10022337820851 8.152050e+11 Company B
722337820853 8.152050e+11 Company C
10056526008010 0.000000e+00 Company E
10022337820851 6.290250e+11 Company D
First ID Second ID Company Availability
878968000512 0.000000e+00 Company F Y
10022337820851 8.152050e+11 Company B Y
10022337820851 8.152050e+11 Company B Y
722337820853 8.152050e+11 Company C N
10056526008010 0.000000e+00 Company E N
10056526008010 0.000000e+00 Company G N
10022337820851 6.290250e+11 Company D Y
I want to be able to merge based on first a match with the First ID then check to see if the Secondary ID matches, if it does then it should be merged, if it doesn't then it shouldn't be (i suspect an inner merge would be needed). If the Secondary ID is 0, then a fuzzy match of the Company name should be done. Is there any way to do this? The desired output would look like this:
First ID Second ID Company Availability
10022337820851 8.152050e+11 Company B Y
722337820853 8.152050e+11 Company C N
10056526008010 0.000000e+00 Company E N
10056526008010 0.000000e+00 Company G N
10022337820851 6.290250e+11 Company D Y
This way the 2 dataframes are first matched based on the First ID, then to check for duplicates the Second ID is used and if the Second ID is 0, a similar match based on Company is done. If no match comes up for Second ID or Company, then no mergge is done for that row.
Let the dataframes be df1
and df2
then you can use merge
on the ids first and then filter for the case of the secondary id being 0:
merged = df1.merge(df2, on=['First ID', 'Second ID'], suffixes=('_left', '_right'))
mask = fuzzy_match(merged['Company_left'], merged['Company_right']) | merged['Second ID_left'] != 0
merged[mask]
Here fuzzy_match
is a function that takes two series and returns a boolean series or array.