My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates.
To do this I am using gensim to generate document vectors. The elements of the resulting docvecs are all in the range [-1,1].
To compare two documents I want to compare the angular similarity. I do this by calculating the cosine similarity of the vectors, which works fine.
But, to cluster the documents HDBSCAN requires a distance matrix, and not a similarity matrix. The native conversion from cosine similarity to cosine distance in sklearn
is 1-similarity
. However, it is my understanding that using this formula can break the triangle inequality preventing it from being a true distance metric. When searching and looking at other people's code for similar tasks, it seems that most people seem to be using sklearn.metrics.pairwise.pairwise_distances(data, metric='cosine')
which is defines cosine distance as 1-similarity
anyway. It looks like it provides appropriate results.
I am wondering if this is correct, or if I should use angular distance instead, calculated as np.arccos(cosine similarity)/pi
. I have also seen people use Euclidean distance on l2-normalized document vectors; this seems to be equivalent to cosine similarity.
Please let me know what is the most appropriate method for calculating distance between document vectors for clustering :)
I believe in practice cosine-distance is used, despite the fact that there are corner-cases where it's not a proper metric.
You mention that "elements of the resulting docvecs are all in the range [-1,1]". That isn't usually guaranteed to be the case – though it would be if you've already unit-normalized all the raw doc-vectors.
If you have done that unit-normalization, or want to, then after such normalization euclidean-distance will always give the same ranked-order of nearest-neighbors as cosine-distance. The absolute values, and relative proportions between them, will vary a little – but all "X is closer to Y than Z" tests will be identical to those based on cosine-distance. So clustering quality should be nearly identical to using cosine-distance directly.