Search code examples
pythonstringlistalgorithmnlp

Match most similar string pairs from two lists of strings?


I have two lists of strings (of equal sizes):

l1 = [ "Which of the following products have you used", "Provide a rating to the product", "Will you purchase the product again" ]

l2 = [ "Please give a rating to the product" "Will you buy the product again" "Please select a product you have used" ]

I have to write a program that should be able to match similar sentences as shown below: Input and Output

There are basically two sub-problems here:

  1. How exactly to quantify/score the similarity between two sentences. I have thought of using either simple character matching algorithm (fuzzywuzzy) in order to keep time taken by the program low, however open to any suggestions for a better algorithm.

  2. How to apply the above algorithm to find actual matches between the strings Now once we have selected an algorithm in the above step, how do we go about applying it to the two lists? One possible way I thought was to generate all possible string-pair combinations and select the one which has the highest score, however this approach fails if the lists have 10 or more items as the time taken exceeds 10 minutes per run on my machine.


Solution

  • Instead of using fuzzy matching, why not use sentence similarity as a criterion for matching sentences semantically (contextual matching)?

    You can use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data

    Once you have embedded each sentence as a vector (50 dim in this case), you can use cosine similarity to determine which sentence embedding is similar to each other using scipy.spatial.distance.cosine for example.

    Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.

    A more advanced usage might be to use a pre-trained sentence embedding model such as BERT based models but those would be for more advanced situations than this and need a bit of deeper understanding of how those models and their parameters work, so I wouldn't recommend using them at this point, until you are comfortable with how this approach works.

    Here is an example -

    from scipy import spatial
    import gensim.downloader as api
    
    model = api.load("glove-wiki-gigaword-50")
    #choose from multiple models https://github.com/RaRe-Technologies/gensim-data
    
    def get_vector(s):
        s1 = [i.lower() for i in s.split()]
        v1 = np.sum(np.array([model[i] for i in s1]), axis=0)
        return v1
    
    l1 =["Which of the following products have you used",
         "Provide a rating to the product", 
         "Will you purchase the product again"]
    
    l2 =["Please give a rating to the product",
         "Will you buy the product again",
         "Please select a product you have used"]
    
    embeddings1 = np.array([get_vector(i) for i in l1])
    embeddings2 = np.array([get_vector(i) for i in l2])
    
    similarity = (1 - spatial.distance.cdist(embeddings1, embeddings2, 'cosine')).argmax(axis=1)
    sorted_l2 = [l2[i] for i in similarity]
    
    print(l1)
    print(sorted_l2)
    
    # original l1 list
    ['Which of the following products have you used', 
    'Provide a rating to the product', 
    'Will you purchase the product again']
    
    # new l2 list after modifying
    ['Please select a product you have used', 
    'Please give a rating to the product', 
    'Will you buy the product again']
    

    As you can see, this is the same result as you are expecting -

    enter image description here