Search code examples
rmatrixsimilaritydendrogram

R: clustering with a similarity or dissimilarity matrix? And visualizing the results


I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:

  • Normalized compression distance (NCD)
  • Damerau-Levenshtein distance
  • Jaro-Winkler distance
  • Levenshtein distance
  • Optimal string alignment distance (OSA)

("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")

At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust.

But, the groups that I get using hclustwith a similarity matrix are much better than the ones I get using hclustand it's correspondent dissimilarity matrix.

I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens.

To get the dendrograms using the similarity function I do:

  1. plot(hclust(as.dist(""similarityMATRIX""), "average"))

With the dissimilarity matrix I tried:

  1. plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))

and

  1. plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))

From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)

I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.

Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?


Solution

  • You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package). You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github).

    Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).