If we have two lists of strings:
A = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
B = "bank, weather, sun, moon, fun, hi".split(",")
The words in list A
constitute my word vector basis.
How can I calculate the cosine similarity scores of each word in B?
What I've done so far: I can calculate the cosine similarity of two whole lists with the following function:
def counter_cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
But how do I have to integrate my vector basis and how can I calculate then the similarities between the terms in B?
import math
from collections import Counter
ListA = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
ListB = "bank, weather, sun, moon, fun, hi".split(",")
def cosdis(v1, v2):
common = v1[1].intersection(v2[1])
return sum(v1[0][ch] * v2[0][ch] for ch in common) / v1[2] / v2[2]
def word2vec(word):
cw = Counter(word)
sw = set(cw)
lw = math.sqrt(sum(c * c for c in cw.values()))
return cw, sw, lw
def removePunctuations(str_input):
ret = []
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for char in str_input:
if char not in punctuations:
ret.append(char)
return "".join(ret)
for i in ListA:
for j in ListB:
print(cosdis(word2vec(removePunctuations(i)), word2vec(removePunctuations(j))))