I've built a Doc2Vec model with around 3M documents, now I want to compare it to another model I've previously built. The second model has been scaled to 0-1 so I now also want to scale the gensim model to the same range so that they are comparable. This is my first time using gensim so I'm not sure how this is done. It's nothing fancy but this is the code I have so far (model generation code ommited). I thought about scaling (minmax scaling with max/min in the union of vectors) the inferred vectors (v1 and v2) but I don't think this would be correct approach. The idea here is to compare two documents (with tokens likely to be in the corpus) and output a similarity score between them. I've seen a few Gensim's tutorials and they often compare a single string to the corpus' documents, which is not really the idea here.
def get_similarity_score(self,string_1, string_2): split_tokens1 = string_1.split() split_tokens2 = string_2.split() v1 = self.model.infer_vector(split_tokens1) v2 = self.model.infer_vector(split_tokens2) text_score = nltk.cluster.util.cosine_distance(v1, v2) return text_score
Any recommendations?
Note that 'cosine similarity' & 'cosine distance' are different things.
A cosine-similarity can range from -1.0
to 1.0
– but in some models, such as those based only on positive word counts, you might only practically see values from 0.0
to 1.0
. But in both cases, items with similarities close to 1.0
are most-similar.
On the other hand, a cosine-distance can range from 0.0
to 2.0
, and items with a distance of 0.0
are least-distant (or nearest). A cosine-distance can be larger than 1.0
- but you might only see such distances in models which use the dense coordinate space (like Doc2Vec
), not in word-count models which leave half the coordinate space empty (all negative coordinates).
So: you shouldn't really be calling your function similarity
if it's returning a distance, and if it's now returning surprise numbers over 1.0
, there's nothing wrong: that's possible in some models, but not others.
You could naively rescale the 0.0
to 2.0
distances that your calculation will get with Doc2Vec
vectors, with some crude hammer like:
new_distance = old_distance / 2
However, note that in general, the absolute similarities from different models are still not necessarily meaningfully comparable. This is even true between two diferent Doc2Vec
models. Their magnitudes are highly influenced by things like the model-metaparameters.
For example, if you used the exact same sufficiently-large set of texts to train a 100-dimensional Doc2Vec
model, and a 300-dimensional Doc2Vec
model, both models might wind up very similarly-useful. And for a doc A, its nearest-neighbor might consistently be doc B. Indeed, its top-10 neighbors might be very similar or identical.
But the cosine-similarities may have far different maxes/ranges, like the same neighbor B having the similarity 0.9
in one but 0.6
in another. Thy're the same docs, and correctly identified as 'most-similar', and neither 0.9
or 0.6
is truly a worse number to report, because in both cases the proper most-similar doc is at the top of the rankings. The models have just wound up using the available coordinate space differently. So, you shouldn't compare that 0.6
or 0.9
similarity (or in your case other distance numbers) against some other model – especially if the models use different algorithms, as seems to be the case for you. (If looks like you may be comparing absolute cosine-distances from a word-counting model against a dense learned Doc2Vec
model.)
It may make more sense to compare result-rankings between the models. That is, ignore the raw similarity numbers, but care whether desirable documents appear in the top-N for other documents. Alternatively, it might be possible to learn some scaling-rule for making the distances more comparable, but it's hard to make a more specific recommendation, or even know if that's a good step to take, without knowing your ultimate goal in comparing the models.