Search code examples
rr-daisy

How to create a dissimilarity matrix with daisy function in R?


I want to perform a cluster analysis with the pam function in R, using daisy to create a dissimilarity matrix. My data contains 2 columns (ID and Disease). Both are factors with a lot of values (400 and 1800 respectively). How can I create the dissimilarity matrix I need to cluster the data using pam?

Example data frame:

set.seed(1)
df <- data.frame(ID = rep(sample(c("a","b","c","d","e","f","g"),10,replace = TRUE),70),
                 disease = sample(c("flu","headache","pain","inflammation","depression","infection","chest pain"),100,replace = TRUE))

df <- unique(df)

Can I run the daisy function on this data frame or do I have to convert it into another format?


Solution

  • Since "Dissimilarities will be computed between the rows of x" (?daisy), you may want to run daisy on the table of your data frame.

    (df.tab <- table(df))
    #    disease
    # ID  chest pain depression flu headache infection inflammation pain
    #   a          1          1   1        1         1            1    1
    #   b          1          1   1        1         1            1    1
    #   c          1          1   0        0         1            1    1
    #   d          1          1   1        0         1            0    1
    #   e          0          1   1        1         1            1    0
    #   f          0          1   1        1         1            0    1
    #   g          1          1   1        1         1            1    0 
    
    library(cluster)    
    daisy(df.tab, metric="euclidean")
    # Dissimilarities :
    #   a        b        c        d        e        f
    # b 0.000000                                             
    # c 1.414214 1.414214                                    
    # d 1.414214 1.414214 1.414214                           
    # e 1.414214 1.414214 2.000000 2.000000                  
    # f 1.414214 1.414214 2.000000 1.414214 1.414214         
    # g 1.000000 1.000000 1.732051 1.732051 1.000000 1.732051
    # 
    # Metric :  euclidean 
    # Number of objects : 7