Search code examples
pythonpandasmergefuzzy-comparison

Merge two dataframes based on fuzzy-matches in two columns


I have 2 dataframes that I am trying to merge based on IDs and a secondary ID. Here are a sample of the two dataframes:

First ID                  Second ID               Company
10056526008010            0.000000e+00            Company A
10022337820851            8.152050e+11            Company B
722337820853              8.152050e+11            Company C
10056526008010            0.000000e+00            Company E
10022337820851            6.290250e+11            Company D
First ID                  Second ID               Company               Availability
878968000512              0.000000e+00            Company F             Y
10022337820851            8.152050e+11            Company B             Y
10022337820851            8.152050e+11            Company B             Y
722337820853              8.152050e+11            Company C             N
10056526008010            0.000000e+00            Company E             N
10056526008010            0.000000e+00            Company G             N
10022337820851            6.290250e+11            Company D             Y

I want to be able to merge based on first a match with the First ID then check to see if the Secondary ID matches, if it does then it should be merged, if it doesn't then it shouldn't be (i suspect an inner merge would be needed). If the Secondary ID is 0, then a fuzzy match of the Company name should be done. Is there any way to do this? The desired output would look like this:

First ID                  Second ID               Company           Availability
10022337820851            8.152050e+11            Company B         Y
722337820853              8.152050e+11            Company C         N    
10056526008010            0.000000e+00            Company E         N
10056526008010            0.000000e+00            Company G         N
10022337820851            6.290250e+11            Company D         Y

This way the 2 dataframes are first matched based on the First ID, then to check for duplicates the Second ID is used and if the Second ID is 0, a similar match based on Company is done. If no match comes up for Second ID or Company, then no mergge is done for that row.


Solution

  • Let the dataframes be df1 and df2 then you can use merge on the ids first and then filter for the case of the secondary id being 0:

    merged = df1.merge(df2, on=['First ID', 'Second ID'], suffixes=('_left', '_right'))
    mask = fuzzy_match(merged['Company_left'], merged['Company_right']) | merged['Second ID_left'] != 0
    merged[mask]
    

    Here fuzzy_match is a function that takes two series and returns a boolean series or array.