Search code examples
pythonpython-3.xnlpnltkcosine-similarity

Right way to calculate the cosine similarity of two word-frequency-dictionaries in python?


I'm trying to iterate through a file containing text and calculate the cosine similarity between the current line and a query the user raised. I have already tokenized the query and the line and saved the union of their words into a set.

Example:

line_tokenized = ['Karl', 'Donald', 'Ifwerson']

query_tokenized = ['Donald', 'Trump']

word_set = ['Karl', 'Donald', 'Ifwerson', 'Trump']

Now I have to create a dictionary each for the line and the query, containing word-frequency pairs. I thought about something ike this:

line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}

But the cosine similarity won't be calculated properly as the key-value pairs are unordered. I came across OrderedDict(), but I don't understand how to implement some things as it's elements are stored as tuples:

So my questions are:

  • How can I set the key-value pairs and have access to them afterwards?
  • How can I increment the value of a certain key?
  • Or is there any other more easier way to do this?

Solution

  • You do not need to order the dictionary for Cosine similarity, simple lookup is sufficient:

    import math
    
    def cosine_dic(dic1,dic2):
        numerator = 0
        dena = 0
        for key1,val1 in dic1.items():
            numerator += val1*dic2.get(key1,0.0)
            dena += val1*val1
        denb = 0
        for val2 in dic2.values():
            denb += val2*val2
        return numerator/math.sqrt(dena*denb)
    

    you simply use a .get(key1,0.0) to lookup of the element exists and if it does not 0.0 is assumed. As a result both dic1 and dic2 do not need to store values with 0 as value.

    To answer your additional questions:

    How can I set the key-value pairs and have access to them afterwards?

    You simply state:

    dic[key] = value
    

    How can I increment the value of a certain key?

    If you know for sure that the key is already part of the dictionary:

    dic[key] +=  1
    

    otherwise you can use:

    dic[key] = dic.get(key,0)+1
    

    Or is there any other more easier way to do this?

    You can use a Counter which is basically a dictionary with some added functionality.