Search code examples
pythonstringstring-matchingsemantics

How to do string semantic matching using gensim in Python?


how can we determine whether a string has a semantical relation with our phrase or not in python?

Example:

our phrase is:

'Fruit and Vegetables'

and the strings we want to check semantical relation in, are:

'I have an apple in my basket', 'I have a car in my house'

result:

as we know the first item I have an apple in my basket has a relation to our phrase.


Solution

  • You can use gensim library to implement MatchSemantic and write code like this as a function (see full code in here):

    Initialization


    1. install the gensim and numpy:
    pip install numpy
    pip install gensim
    

    Code


    1. first of all, we must implement the requirements
    from re import sub
    import numpy as np
    from gensim.utils import simple_preprocess
    import gensim.downloader as api
    from gensim.corpora import Dictionary
    from gensim.models import TfidfModel
    from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
    
    1. use this function to check if the strings and sentences match the phrase you want.
    def MatchSemantic(query_string, documents):
        stopwords = ['the', 'and', 'are', 'a']
    
        if len(documents) == 1: documents.append('')
    
        def preprocess(doc):
            # Tokenize, clean up input document string
            doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
            doc = sub(r'<[^<>]+(>|$)', " ", doc)
            doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
            doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
            return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
    
        # Preprocess the documents, including the query string
        corpus = [preprocess(document) for document in documents]
        query = preprocess(query_string)
    
        # Load the model: this is a big file, can take a while to download and open
        glove = api.load("glove-wiki-gigaword-50")
        similarity_index = WordEmbeddingSimilarityIndex(glove)
    
        # Build the term dictionary, TF-idf model
        dictionary = Dictionary(corpus + [query])
        tfidf = TfidfModel(dictionary=dictionary)
    
        # Create the term similarity matrix.
        similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
    
        query_tf = tfidf[dictionary.doc2bow(query)]
    
        index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)
    
        return index[query_tf]
    
    

    Attention: if run the code for the first time a process bar will go from 0% to 100% for downloading glove-wiki-gigaword-50 of the gensim and after that everything will be set and you can simply run the code.

    Usage


    for example, we want to see if Fruit and Vegetables matches any of the sentences or items inside documents

    Test:

    query_string = 'Fruit and Vegetables'
    documents = ['I have an apple on my basket', 'I have a car in my house']
    MatchSemantic(query_string, documents)
    

    so we know that the first item I have an apple on my basket has a semantical relation with Fruit and Vegetables so its score will be 0.189 and for the second item no relation will be found so its score will be 0

    output:

    0.189    # I have an apple in my basket
    0.000    # I have a car in my house