Search code examples
algorithmcluster-analysisk-meanscosine-similaritytrigonometry

How does cosine similarity used with K-means algorithm?


For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms:

Q1: how cosine similarity used by k-means does then how the clusters are constructed.

Q2: when I use TF-IDF algo. Its produce a negative values is there any problem in my calculation?

Please use the following docs vectors is VSM (tf.idf) where all have different vector length for explanation purposes.

Doc1 (0.134636045, -0.000281926, -0.000281926, -0.000281926, -0.000281926, 0)
Doc2 (-0.002354898, 0.012411358, 0.012411358, 0.09621575, 0.3815553)
Doc3(-0.001838258, 0.009688438, 0.019376876, 0.05633028, 0.59569238, 0.103366223, 0) 

i will thank any one can give explanation about my question.


Solution

  • Cosine similarity means you take the dot product of the vector / k mean centre rather than the Euclidean distance.

    Dot product is a.xb.x + a.yb.y ... + a.zz*b.zz for all the dimensions. You generally normalize the vectors first. Then call acos() on the result.

    Essentially you're dividing the results into sectors rather than into randomly-clumped clusters.