Search code examples
pythoncosine-similarity

Cosine Similarity between two words in a context in Python


I am trying to perform in python the cosine similarity between two words which are in a dataset of texts (each text represents a tweet). I want to evaluate the similarity based on the context where they are placed.

I have set a code like the following:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = dataset
// corpus is a list of texts (in this case is a list of tweets)
vectorizer = TfidfVectorizer()
trsfm = vectorizer.fit_transform(corpus)
sims = cosine_similarity(trsfm, trsfm)
counts = count_vect.fit_transform(corpus)
pd.DataFrame(trsfm.toarray(), columns = vectorizer.get_feature_names(), index = corpus)
vectorizer.get_feature_names()

The result is the similarity between the texts but I want the similarity between two words.

So, wow can I obtain the similarity between two words and not between two texts? For instance, I want the similarity between these couple of words: {["covid","vaccine"], ["work","covid"], ["environment","pollution"]}.

In addition, I want to represet these words in a cartesian plane in order to display graphically the distances amongst them. So I need to calculate their cartesian coordinates.

Is there anyone who can help me?


Solution

  • Here are some useful links you can get start with -

    https://www.tensorflow.org/text/guide/word_embeddings  
    https://arxiv.org/abs/1810.04805  
    https://machinelearningmastery.com/what-are-word-embeddings/  
    https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/