Search code examples
pythonpandasdifflib

Python pandas, substitute all elements in a dataframe column with closest string match from known list


I have a dataframe where the values in a column are strings which are often mispelled, something like:

col01 | col02 | name
-----------------------
 ---  |  ---  |  mrk 
 ---  |  ---  |  anth3n7 
 ---  |  ---  |  j4ck 

and a list of possible correct values for this column

possible_names = ['mark', 'anthony', 'jack']

I wanted to go over the elements of the 'name' column, and get the closest match from the list of correct names, so that I would end up with:

col01 | col02 | name      |   correct_name
-----------------------------------------
 ---  |  ---  |  mrk      |       mark
 ---  |  ---  |  anth3n7  |      anthony
 ---  |  ---  |  j4ck     |       jack

I tried using this code:

df_names = pd.read_csv('names/wrong.csv')
possible_names = ['mark', 'anthony', 'jack']
df_names['correct_name'] = difflib.get_close_matches(df_names['name'], possible_names)

but I get the error:

ValueError: Length of values (0) does not match length of index (3)

Solution

  • Use:

    import difflib
    
    import pandas as pd
    
    df = pd.DataFrame(data=["mrk", "anth3n7", "j4ck"], columns=["name"])
    possible_names = ['mark', 'anthony', 'jack']
    
    df["correct_name"] = df["name"].apply(lambda x: difflib.get_close_matches(x, possible_names)[0])
    print(df)
    

    Output

          name correct_name
    0      mrk         mark
    1  anth3n7      anthony
    2     j4ck         jack
    

    As an alternative instead of .apply, you could use a list comprehension:

    df["correct_name"] = [difflib.get_close_matches(name, possible_names)[0] for name in df["name"]]
    

    UPDATE

    For the case of no match, one approach is to use:

    possible_names = ['anthony', 'jack']
    df["correct_name"] = [next(iter(difflib.get_close_matches(name, possible_names)), name) for name in df["name"]]
    print(df)
    

    Output

          name correct_name
    0      mrk          mrk
    1  anth3n7      anthony
    2     j4ck         jack