I am trying to remove similar keywords from a list of keywords:
keywords=['ps4 pro deals','ps4 pro deals London']
So I just need "ps4 pro deals" by removing the similar one. The code I tried which uses Leveshtein distance for similarity checking:
similar_tags = []
to_be_removed = []
for word1 in keywords:
for word2 in keywords:
if .5 < Levenshtein.token_sort_ratio(word1, word2)< 1 :
to_be_removed.append(word1)
for word in to_be_removed:
if word in keywords:
keywords.remove(word)
This code removes both keywords instead of the similar one.
Consider following simple example:
words = ['A','B']
for w1 in words:
for w2 in words:
print(w1,w2)
Output:
A A
A B
B A
B B
Note that there is A B
and B A
. If A B
does meet criteria, then B A
also does (for Levenshtein distance order of input elements is irrelevant), first cause addition of A
to remove list, second cause additon of B
to remove list and therefore both A
and B
are removed.
You might use following construct in which w2
is always after w1
in words
:
words = ['A','B','C']
for inx, w1 in enumerate(words, 1):
for w2 in words[inx:]:
print(w1,w2)
Output:
A B
A C
B C
Explanation: for every w1 in word I take slice of words with elements beyond it. I use enumerate
to get information how many elements needs to be skipped and then slicing to skip them.