Search code examples
rmatrixcluster-analysishamming-distanceterm-document-matrix

R: clustering documents


I've got a documentTermMatrix that looks as follows:

      artikel naam product personeel loon verlof    
 doc 1    1       1    2        1        0    0     
 doc 2    1       1    1        0        0    0    
 doc 3    0       0    1        1        2    1   
 doc 4    0       0    0        1        1    1   

In the package tm, it's possible to calculate the hamming distance between 2 documents. But now I want to cluster all the documents that have a hamming distance smaller than 3. So here I would like that cluster 1 is document 1 and 2, and that cluster 2 is document 3 and 4. Is there a possibility to do that?


Solution

  • I saved your table to myData:

    myData
         artikel naam product personeel loon verlof
    doc1       1    1       2         1    0      0
    doc2       1    1       1         0    0      0
    doc3       0    0       1         1    2      1
    doc4       0    0       0         1    1      1
    

    Then used hamming.distance() function from e1071 library. You can use your own distances (as long as they are in the matrix form)

    lilbrary(e1071)
    distMat <- hamming.distance(myData)
    

    Followed by hierarchical clustering using "complete" linkage method to make sure that the maximum distance within one cluster could be specified later.

    dendrogram <- hclust(as.dist(distMat), method="complete")
    

    Select groups according to the maximum distance between points in a group (maximum = 5)

    groups <- cutree(dendrogram, h=5)
    

    Finally plot the results:

    plot(dendrogram)  # main plot
    points(c(-100, 100), c(5,5), col="red", type="l", lty=2)  # add cutting line
    rect.hclust(dendrogram, h=5, border=c(1:length(unique(groups)))+1)  # draw rectangles
    

    hclust

    Another way to see the cluster membership for each document is with table:

    table(groups, rownames(myData))
    
    groups doc1 doc2 doc3 doc4
         1    1    1    0    0
         2    0    0    1    1
    

    So documents 1st and 2nd fall into one group while 3rd and 4th - to another group.