Search code examples
pythonliststring-comparison

Remove close matches / similar phrases from list


I am working on removing similar phrases in a list, but I have hit a small roadblock.

I have sentences and phrases, phrases are related to the sentence. All phrases of a sentence are in a single list.

Let the phrase list be : p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]

I want my output to be [['This is great','place for drinks'],['Tonight is a good','for movies']]

Basically, I want to get all the longest unique phrases of a list.

I took a look at fuzzywuzzy library, but I am unable to get around to a good solution.

here is my code :

def remove_dup(arr, threshold=80):
    ret_arr =[]
    for item in arr:
        if item[1]<threshold:
            ret_arr.append(item[0])
    return ret_arr

def find_important(sents=sents, phrase=phrase):

    import os, random
    from fuzzywuzzy import process, fuzz

    all_processed = [] #final array to be returned
    for i in range(len(sents)):

        new_arr = [] #reshaped phrases for a single sentence
        for item in phrase[i]:
            new_arr.append(item)

        new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length

        important = [] #array to store terms
        important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
        to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
        to_proc.append(important[0][0]) #the term with highest match is obviously the important term.


        all_processed.append(to_proc) #add non duplicates to all_processed[]

    return all_processed

Can someone point out what I am missing, or what is a better way to do this? Thanks in advance!


Solution

  • I would use the difference between each phrase and all the other phrases. If a phrase has at least one different word compared to all the other phrases then it's unique and should be kept.

    I've also made it robust to exact matches and added spaces

    sentences = [['This is great','is great','place for drinks','for drinks'],
    ['Tonight is a good','good night','is a good','for movies'],
    ['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
    
    new_sentences = []
    s = " "
    for phrases in sentences :
        new_phrases = []
        phrases = [phrase.split() for phrase in phrases]
        for i in range(len(phrases)) :
            phrase = phrases[i]
            if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
                new_phrases.append(phrase)
        new_phrases = [s.join(phrase) for phrase in new_phrases]
        new_sentences.append(new_phrases)
    print(new_sentences)
    

    Output:

    [['This is great', 'place for drinks'],

    ['Tonight is a good', 'good night', 'for movies'],

    ['Axe far his favorite brand for deodorant body spray', 'Axe is']]