Search code examples
pythonscikit-learncosine-similarity

Cosine similarity output different scipy vs sklearn


I'm sure I'm overlooking something but why are these outputs different?

scikit learn

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([[3,5,1]],[[1,2,3]])

### output `array([[0.72280632]])`

scipy

from scipy.spatial.distance import cosine
cosine([3,5,1],[1,2,3])

### output 0.27719367767579906

Why aren't these the same? From my calculations, it doesn't seem that it's the difference between using the L1 or L2 norm in the denominator


Solution

  • The definitions for cosine distance that they use are different.

    The docstring for sklearn.metrics.pairwise.cosine_similarity says:

    Compute cosine similarity between samples in X and Y. Cosine similarity, or the cosine kernel, computes similarity as the normalized dot product of X and Y:

    $cosine(X, Y) = < X, Y > / (||X|| * ||Y||)$

    While scipy.spatial.distance.cosine says:

    The Cosine distance between X and Y, is defined as

    $cosine(X, Y) = 1 - < X, Y > / (||X|| * ||Y||)$.

    where $< X, Y >$ is the dot product between $X$ and $Y$ and $||X||$ is the L2 norm.

    (I changed the doc strings a little to use the same variable names and mathematical conventions for an easier comparison.)

    Basically, you have 1 - cosine_sklearn = cosine_scipy.