python machine-learning cosine-similarity word-embedding machine-translation

Cosine similarities and totally different results using same source

I am learning word embeddings and cosine similarity. My data is composed of two sets of same words but in 2 different languages.

I did two tests:

I measured the cosine similarity using the average of the word vectors (that I think it should be called soft cosine similarity instead)
I measured the cosine similarity using the word vectors

Should I expect to obtain quite the same results? I noticed that sometimes I have two opposite results. Since I am new on this, I am trying to figure out if I did something wrong or if there is an explanation behind. According to what I have been reading, soft cosine similarity should be more accurate than the usual cosine similarity.

Now, it's time for some data to show you. Unfortunately I can't post a part of my data (the words themselves), but I will try my best to give you the max of information I can give you.

Some other details before:

I am using FastText to create the embeddings, skipgram model with default parameters.
For the soft cosine similarity, I am using Scipy spatial distance cosine. Following some people suggestions, to measure cosine similarity it seems that I should subtract 1 from the formula, such as:

(1-distance.cosine(data['LANG1_AVG'].iloc[i],data['LANG2_AVG'].iloc[i]))

For the usual cosine similarity I am using the Fast Vector cosine similarity from FastText Multilingual, defined in this way:

@classmethod def cosine_similarity(cls, vec_a, vec_b): """Compute cosine similarity between vec_a and vec_b""" return np.dot(vec_a, vec_b) / \ (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))

As you will see from the image here, for some words I obtained the same results or quite similar using the two methods. For others I obtained two totally different results. How can I explain this?

Solution

After some more additional research, I found a 2014 paper (Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model) that explains when and how it could be useful to use averages of the features, and it explains also what is exactly a soft cosine measure:

Our idea is more general: we propose to modify the manner of calculation of similarity in Vector Space Model taking into account similarity of features. If we apply this idea to the cosine measure, then the “soft cosine measure” is introduced, as opposed to traditional “hard cosine”, which ignores similarity of features. Note that when we consider similarity of each pair of features, it is equivalent to introducing new features in the VSM. Essentially, we have a matrix of similarity between pairs of features and all these features represent new dimensions in the VSM.