I have a dictionary like this:
dict = {in : [0.01, -0.07, 0.09, -0.02], and : [0.2, 0.3, 0.5, 0.6], to : [0.87, 0.98, 0.54, 0.4]}
I want to calculate the cosine similarity between each word for which I have a cosine similarity function that takes two vectors. First, it will take value for 'in' and 'and', then it should take value for 'in' and 'to' and so on.
I want it to store the result of this in another dictionary, where 'in' should be the key, and the values should be a dictionary of each computed cosine similarity value with that key. Like I want the output to be like this:
{in : {and : 0.4321, to : 0.218}, and : {in : 0.1245, to : 0.9876}, to : { in : 0.8764, and : 0.123}}
Below is the code which is doing all of this:
def cosine_similarity(vec1,vec2):
sum11, sum12, sum22 = 0, 0, 0
for i in range(len(vec1)):
x = vec1[i]; y = vec2[i]
sum11 += x*x
sum22 += y*y
sum12 += x*y
return sum12/math.sqrt(sum11*sum22)
def resultInDict(result,name,value,keyC):
new_dict={}
new_dict[keyC]=value
if name in result:
result[name] = new_dict
else:
result[name] = new_dict
def extract():
result={}
res={}
with open('file.txt') as text:
for line in text:
record = line.split()
key = record[0]
values = [float(value) for value in record[1:]]
res[key] = values
for key,value in res.iteritems():
temp = 0
for keyC,valueC in res.iteritems():
if keyC == key:
continue
temp = cosine_similarity(value,valueC)
resultInDict(result,key,temp,keyC)
print result
But, it's giving the result like this:
{'and': {'in': 0.12241083209661485}, 'to': {'in': -0.0654517869126785}, 'from': {'in': -0.5324142931780856}, 'in': {'from': -0.5324142931780856}}
I want it to be like this:
{in : {and : 0.4321, to : 0.218}, and : {in : 0.1245, to : 0.9876}, to : { in : 0.8764, and : 0.123}}
I feel it is because in the resultInDict function I am defining a new dictionary new_dict to add key values for the inner dictionary, but each time the function resultInDict is called, it empties the new_dict on this line new_dict={}
, and only adds the one key value pair.
How can I fix this??
Not very elegant, but it does the work:
import math
def cosine_similarity(vec1,vec2):
sum11, sum12, sum22 = 0, 0, 0
for i in range(len(vec1)):
x = vec1[i]; y = vec2[i]
sum11 += x*x
sum22 += y*y
sum12 += x*y
return sum12/math.sqrt(sum11*sum22)
mydict = {"in" : [0.01, -0.07, 0.09, -0.02], "and" : [0.2, 0.3, 0.5, 0.6], "to" : [0.87, 0.98, 0.54, 0.4]}
mydict_keys = mydict.keys()
result = {}
for k1 in mydict_keys:
temp_dict = {}
for k2 in mydict_keys:
if k1 != k2:
temp_dict[k2] = cosine_similarity(mydict[k1], mydict[k2])
result[k1] = temp_dict
Also, if you have big data structures, consider to use scipy
(http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html) or scikit-learn
(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) for calculating the cosine similarity in a more efficient way (the latter is not only quick, but also memory friendly, because you can feed it a scipy.sparse
matrix).