Search code examples
pythonstringpython-3.xpandaslevenshtein-distance

Python - Assign the closest string from List A to List B based on Levenshtein distance - (ideally with pandas)


As introduction, I am pretty new to python, I just know how to use pandas mainly for data analysis.

I currently have 2 lists of 100+ entries, "Keywords" and "Groups".

I would like to generate an output (ideally a dataframe in pandas), where for every entry of the list "Keywords", the closest entry of the list "Groups" is assigned, using the levenshtein distance method.

Thank you for your support!


Solution

  • from editdistance import eval as levenshtein
    import pandas as pd
    
    keywords = ["foo", "foe", "bar", "baz"]
    groups = ["foo", "bar"]
    
    assigned_groups = [min(groups, key=lambda g: levenshtein(g, k))
                       for k in keywords]
    
    df = pd.DataFrame({"Keyword": keywords, "Group": assigned_groups})
    #   Group Keyword
    # 0   foo     foo
    # 1   foo     foe
    # 2   bar     bar
    # 3   bar     baz
    

    Using editdistance. Get it with pip install editdistance.

    Note that this algorithm is O(mn), where m is the length of the keywords and n the length of the groups.