How i calculate the distance between two documents? In the k-means for numbers you have to caculate the distance between two points. I know that i can use the cosinus function. I want to perform clustering to rss documents. I have done stemming and removed the stop words from the documents. I have counted the frequency of word in each document. And now i want to implement the k-mean algorithm.
I'm assuming that your difficulty is in creating the feature vector? Create a feature vector for each document by
For example, if you have
Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat
Then the total set of words is [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] and the document vectors are
Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]
And now you just have two giant feature vectors that you can use to represent the document and you can use k-means clustering. As others have said, Euclidean distance can be used to calculate the distance between documents.