As introduction, I am pretty new to python, I just know how to use pandas mainly for data analysis.
I currently have 2 lists of 100+ entries, "Keywords" and "Groups".
I would like to generate an output (ideally a dataframe in pandas), where for every entry of the list "Keywords", the closest entry of the list "Groups" is assigned, using the levenshtein distance method.
Thank you for your support!
from editdistance import eval as levenshtein
import pandas as pd
keywords = ["foo", "foe", "bar", "baz"]
groups = ["foo", "bar"]
assigned_groups = [min(groups, key=lambda g: levenshtein(g, k))
for k in keywords]
df = pd.DataFrame({"Keyword": keywords, "Group": assigned_groups})
# Group Keyword
# 0 foo foo
# 1 foo foe
# 2 bar bar
# 3 bar baz
Using editdistance
. Get it with pip install editdistance
.
Note that this algorithm is O(mn)
, where m
is the length of the keywords and n
the length of the groups.