Search code examples
pythonstringnlpsimilaritycosine-similarity

Calculate cosine similarity given 2 sentence strings


From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?

s1 = "This is a foo bar sentence ."
s2 = "This sentence is similar to a foo bar sentence ."
s3 = "What is this string ? Totally not related to the other two lines ."

cosine_sim(s1, s2) # Should give high cosine similarity
cosine_sim(s1, s3) # Shouldn't give high cosine similarity value
cosine_sim(s2, s3) # Shouldn't give high cosine similarity value

Solution

  • A simple pure-Python implementation would be:

    import math
    import re
    from collections import Counter
    
    WORD = re.compile(r"\w+")
    
    
    def get_cosine(vec1, vec2):
        intersection = set(vec1.keys()) & set(vec2.keys())
        numerator = sum([vec1[x] * vec2[x] for x in intersection])
    
        sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
        sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
        denominator = math.sqrt(sum1) * math.sqrt(sum2)
    
        if not denominator:
            return 0.0
        else:
            return float(numerator) / denominator
    
    
    def text_to_vector(text):
        words = WORD.findall(text)
        return Counter(words)
    
    
    text1 = "This is a foo bar sentence ."
    text2 = "This sentence is similar to a foo bar sentence ."
    
    vector1 = text_to_vector(text1)
    vector2 = text_to_vector(text2)
    
    cosine = get_cosine(vector1, vector2)
    
    print("Cosine:", cosine)
    

    Prints:

    Cosine: 0.861640436855
    

    The cosine formula used here is described here.

    This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.

    You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.