Search code examples
python-3.xpandasdifflib

Merge two Dataframes on two columns with different length by closest match


I want to merge these example dataframes:

  1. How to get the closest matches in a new df?
df1:

name           age      department
DJ Griffin     27       FD
Harris Smith   33       RD


df2:

name               age      department
D.J. Griffin III   27       FD
Harris Smith       33       RD
Miles Jones        58       RD

The result should look like:

df3:

name         age      department   name_y
DJ Griffin     27       FD         D.J. Griffin III
Harris Smith   33       RD         Harris Smith

Used Difflib but got an error, cause of the different length of the dfs.

import pandas as pd
import difflib

df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])

df2['name_y'] = df2['name']

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])

Result:

IndexError: list index out of range
  1. How to find the closest match by two columns when there is another Harris Smith with the age of 45?
For the duplicate Harris Smith case

df1:

name           age      department
DJ Griffin     27       FD
Harris Smith   33       RD
Harris Smith   45       BA

df2:

name               age      department
D.J. Griffin III   27       FD
Harris Smith       33       RD
Harris Smith       45       BA
Miles Jones        58       RD

The result should look like:

df3:

name         age      department   name_y
DJ Griffin     27       FD         D.J. Griffin III
Harris Smith   33       RD         Harris Smith
Harris Smith   45       BA         Harris Smith
import pandas as pd
import difflib

df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])

df2['name_y'] = df2['name']

Thanks for your help.


Solution

  • The issue arises when you have zero match, slicing [0] is not possible.

    You could use instead:

    df2['name'].apply(lambda x: next(iter(difflib.get_close_matches(x, df1['name'])), pd.NA))
    

    or

    df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])).str[0]
    

    output:

    0      DJ Griffin
    1    Harris Smith
    2             NaN
    Name: name, dtype: object
    

    update:

    df1.merge(df2[['name', 'age']]
     .assign(name_y=df2['name'],
             name=df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])))
     .explode('name')
     .drop_duplicates(),
     on=['name', 'age']
    )
    

    output:

               name  age department            name_y
    0    DJ Griffin   27         FD  D.J. Griffin III
    1  Harris Smith   33         RD      Harris Smith
    2  Harris Smith   45         BA      Harris Smith