For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms:
Q1: how cosine similarity used by k-means does then how the clusters are constructed.
Q2: when I use TF-IDF algo. Its produce a negative values is there any problem in my calculation?
Please use the following docs vectors is VSM (tf.idf) where all have different vector length for explanation purposes.
Doc1 (0.134636045, -0.000281926, -0.000281926, -0.000281926, -0.000281926, 0)
Doc2 (-0.002354898, 0.012411358, 0.012411358, 0.09621575, 0.3815553)
Doc3(-0.001838258, 0.009688438, 0.019376876, 0.05633028, 0.59569238, 0.103366223, 0)
i will thank any one can give explanation about my question.
Cosine similarity means you take the dot product of the vector / k mean centre rather than the Euclidean distance.
Dot product is a.xb.x + a.yb.y ... + a.zz*b.zz for all the dimensions. You generally normalize the vectors first. Then call acos() on the result.
Essentially you're dividing the results into sectors rather than into randomly-clumped clusters.