Search code examples
pythonpython-3.xnlplevenshtein-distance

How to remove similar keywords from a list in python?


I am trying to remove similar keywords from a list of keywords:

keywords=['ps4 pro deals','ps4 pro deals London']

So I just need "ps4 pro deals" by removing the similar one. The code I tried which uses Leveshtein distance for similarity checking:

similar_tags = [] 
to_be_removed = []
for word1 in keywords:
    for word2 in keywords:
        if .5 < Levenshtein.token_sort_ratio(word1, word2)< 1 :
            to_be_removed.append(word1)

for word in to_be_removed:
    if word in keywords:
        keywords.remove(word)

This code removes both keywords instead of the similar one.


Solution

  • Consider following simple example:

    words = ['A','B']
    for w1 in words:
        for w2 in words:
            print(w1,w2)
    

    Output:

    A A
    A B
    B A
    B B
    

    Note that there is A B and B A. If A B does meet criteria, then B A also does (for Levenshtein distance order of input elements is irrelevant), first cause addition of A to remove list, second cause additon of B to remove list and therefore both A and B are removed.

    You might use following construct in which w2 is always after w1 in words:

    words = ['A','B','C']
    for inx, w1 in enumerate(words, 1):
        for w2 in words[inx:]:
            print(w1,w2)
    

    Output:

    A B
    A C
    B C
    

    Explanation: for every w1 in word I take slice of words with elements beyond it. I use enumerate to get information how many elements needs to be skipped and then slicing to skip them.