Search code examples
pythoncsvcluster-analysistf-idfcosine-similarity

K-means cluster given a CSV with (tf-idf cosine similarity, doc_id1, doc_id2)?


I have a CSV with the following dataset:

similarity  | doc_id1   | doc_id2
1           |    34     |     0
1           |    29     |     6
0.997801748 |    22     |    10
0.966014701 |    35     |    16
0.964811948 |    14     |    13

Where "similarity" refers to a value from tf-idf cosine similarity computations and the doc_ids refer to documents. So, the closer similarity is to 1, the more similar the two documents are.

I want to cluster the documents based on this information, but I'm not entirely sure how to do so. I've been reading a lot about spherical K-means clustering, but in terms of implementing it I'm having a hard time wrapping my head around it. Is there a library that might be useful? Is K-means the right way to go at all?

EDIT: This CSV is all I have, so even though I wish I had word frequency based vectors, I don't. If K-means won't work given that all I have are similarities, are there other algorithms that would suit this data?


Solution

  • I believe that your problem is that you have distances, but K-Means uses Euclidean distances from centroids. This means, that you will need a vector for each document, pretty long vectors in your case. Instead of calculated similarity you should use one dimension for all word, and the score for that word in each document would make their coordinate. With these vectors you can use sklearn.cluster.KMeans suggested by Sam B.