I created a script to calculate the Levenshtein distance of two strings. Now I want to group a list of strings in base of the Levenshtein distance. (if strings have a distance under a threshold, they will be in the same groub):
At the moment, I have done something but doesn't seems to work. Here is a pseudo-code:
for every string in list:
create a new cluster with this string
remove the string from the list
for every string in the remaining list:
if distance(string1,string2) < threshold:
add string2 to the cluster
remove string2 from the list
Here is the real code since a few users asked it:
cid = 0
clusters = {}
numb = range(len(mylist))
for i in numb:
cls = [mylist[i]]
numb.remove(i)
for j in numb:
if distance(mylist[i],mylist[j]) <= threshold:
cls.append(mylist[j])
numb.remove(j)
clusters[cid] = cls
cid+=1
cls = []
You are removing items from the list you are iterating over. That will change the indexes and cause unexpected behavior you should copy the values you need to a new list rather than modifying the list you are iterating on.