Search code examples
pythoncosine-similarity

Calculate cosine similarity between words


If we have two lists of strings:

A = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
B = "bank, weather, sun, moon, fun, hi".split(",")

The words in list A constitute my word vector basis. How can I calculate the cosine similarity scores of each word in B?

What I've done so far: I can calculate the cosine similarity of two whole lists with the following function:

def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

But how do I have to integrate my vector basis and how can I calculate then the similarities between the terms in B?


Solution

  • import math
    from collections import Counter
    
    ListA = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
    ListB = "bank, weather, sun, moon, fun, hi".split(",")
    
    def cosdis(v1, v2):
        common = v1[1].intersection(v2[1])
        return sum(v1[0][ch] * v2[0][ch] for ch in common) / v1[2] / v2[2]
    
    def word2vec(word):
        cw = Counter(word)
        sw = set(cw)
        lw = math.sqrt(sum(c * c for c in cw.values()))
        return cw, sw, lw
    
    def removePunctuations(str_input):
        ret = []
        punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
        for char in str_input:
            if char not in punctuations:
                ret.append(char)
    
        return "".join(ret)
    
    
    for i in ListA:
        for j in ListB:
           print(cosdis(word2vec(removePunctuations(i)), word2vec(removePunctuations(j))))