I want to perform a cluster analysis with the pam
function in R, using daisy
to create a dissimilarity matrix. My data contains 2 columns (ID and Disease). Both are factors with a lot of values (400 and 1800 respectively). How can I create the dissimilarity matrix I need to cluster the data using pam
?
Example data frame:
set.seed(1)
df <- data.frame(ID = rep(sample(c("a","b","c","d","e","f","g"),10,replace = TRUE),70),
disease = sample(c("flu","headache","pain","inflammation","depression","infection","chest pain"),100,replace = TRUE))
df <- unique(df)
Can I run the daisy
function on this data frame or do I have to convert it into another format?
Since "Dissimilarities will be computed between the rows of x" (?daisy
), you may want to run daisy
on the table
of your data frame.
(df.tab <- table(df))
# disease
# ID chest pain depression flu headache infection inflammation pain
# a 1 1 1 1 1 1 1
# b 1 1 1 1 1 1 1
# c 1 1 0 0 1 1 1
# d 1 1 1 0 1 0 1
# e 0 1 1 1 1 1 0
# f 0 1 1 1 1 0 1
# g 1 1 1 1 1 1 0
library(cluster)
daisy(df.tab, metric="euclidean")
# Dissimilarities :
# a b c d e f
# b 0.000000
# c 1.414214 1.414214
# d 1.414214 1.414214 1.414214
# e 1.414214 1.414214 2.000000 2.000000
# f 1.414214 1.414214 2.000000 1.414214 1.414214
# g 1.000000 1.000000 1.732051 1.732051 1.000000 1.732051
#
# Metric : euclidean
# Number of objects : 7