python string python-3.x pandas levenshtein-distance

Python - Assign the closest string from List A to List B based on Levenshtein distance - (ideally with pandas)

As introduction, I am pretty new to python, I just know how to use pandas mainly for data analysis.

I currently have 2 lists of 100+ entries, "Keywords" and "Groups".

I would like to generate an output (ideally a dataframe in pandas), where for every entry of the list "Keywords", the closest entry of the list "Groups" is assigned, using the levenshtein distance method.

Thank you for your support!

Solution

from editdistance import eval as levenshtein
import pandas as pd

keywords = ["foo", "foe", "bar", "baz"]
groups = ["foo", "bar"]

assigned_groups = [min(groups, key=lambda g: levenshtein(g, k))
                   for k in keywords]

df = pd.DataFrame({"Keyword": keywords, "Group": assigned_groups})
#   Group Keyword
# 0   foo     foo
# 1   foo     foe
# 2   bar     bar
# 3   bar     baz

Using editdistance. Get it with pip install editdistance.

Note that this algorithm is O(mn), where m is the length of the keywords and n the length of the groups.