python nlp keyword tf-idf cosine-similarity

Cosine Similarity between keywords

I'm new to document similarity in python and I'm confused about how to go about working with some data. Basically, I want to get the cosine similarity between dicts containing keywords.

I have dicts like so, which I am getting straight from a database:

{'hat': 0.12, 'cat': 0.33, 'sat': 0.45}
{'rat': 0.22, 'bat':0.98, 'cat': 0.01}

I query the database and I get back data in this format. These are each lists of keywords and their respective tf-idf scores/weights.

{'keyword': tfidf_score}

All I want to do is get the cosine similarity between these two dicts, weighted by the tfidf score. Looking online, I was pretty overwhelmed by all the different python libraries/modules when it comes to document similarity. I have no idea if there is some built-in function out there that I can just pass these sorts of json objects to, if I should be writing my own function that uses the weights, or what.

Any help is appreciated!

Thank you!

Solution

The SciKit Learn library has a fairly simple cosine metric. While I agree the library is large and can seem overwhelming you can dip into small parts.

I'm not exactly sure what you are trying to achieve by comparing things in the way you suggest, but if you are trying to get the cosine similarity between documents represented by keywords in a corpus, you first need (as Marmikshah points out) to have a vector representation of the docs in keyword terms (dimensions).

e.g.

import logging
import numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

logging.basicConfig(level=logging.DEBUG,
                    filename='test.log', filemode='w')

dataset = ['the cat sat on the mat',
          'the rat sat in the hat',
          'the hat sat on the bat']


vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(dataset)

# ...you say you are already at this point here...

sims = cosine_similarity(X_tfidf, X_tfidf)
rank = list(reversed(numpy.argsort(sims[0])))

logging.debug("\nTdidf: \n%s" % X_tfidf.toarray())
logging.debug("\nSims: \n%s", sims)
logging.debug("\nRank: \n%s", rank)

Normally e.g. in a search, you'd first vectorise the corpus in advance, then you vectorise the search query and get the sims of its representation:

Y_tfidf = vectorizer.fit_transform(search_query)
sims = cosine_similarity(Y_tfidf, X_tfidf)

Then rank and pick/present the top documents.

I modified this X,Y to cross reference documents within the corpus instead above as X, X.