Search code examples
rcluster-analysisr-daisy

Clustering using daisy and pam in R


I'm trying to perform a pretty straightforward clustering analysis but can't get the results right. My question for a large dataset is "Which diseases are frequently reported together?". The simplified data sample below should result in 2 clusters: 1) headache / dizziness 2) nausea / abd pain. However, I can't get the code right. I'm using the pam and daisy functions. For this example I manually assign 2 clusters (k=2) because I know the desired result, but in reality I explore several values for k.

Does anyone know what I'm doing wrong here?

library(cluster)
library(dplyr)

dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                  PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))


gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k)  # performs cluster analysis
pam_results <- dat %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
head(pam_results$the_summary)

Solution

  • The format in which you give the dataset to the clustering algorithm is not precise for your objective. In fact, if you want to group diseases that are reported together but you also include IDs in your dissimilarity matrix, they will have a part in the matrix construction and you do not want that, since your objective regards only the diseases.

    Hence, we need to build up a dataset in which each row is a patient with all the diseases he/she reported, and then construct the dissimilarity matrix only on the numeric features. For this task, I'm going to add a column presence with value 1 if the disease is reported by the patient, 0 otherwise; zeros will be filled automatically by the function pivot_wider (link).

    Here is the code I used and I think I reached what you wanted to, please tell me if it is so.

    library(cluster)
    library(dplyr)
    library(tidyr)
    
    dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                      PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
                      presence = 1)
    # build the wider dataset: each row is a patient
    dat_wider <- pivot_wider(
        dat,
        id_cols = ID,
        names_from = PTName,
        values_from = presence,
        values_fill = list(presence = 0)
    )
    
    # in the dissimalirity matrix construction, we leave out the column ID
    gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
    k <- 2
    
    set.seed(123)
    pam_fit <- pam(gower_dist, diss = TRUE, k) 
    pam_results <- dat_wider %>%
        mutate(cluster = pam_fit$clustering) %>%
        group_by(cluster) %>%
        do(the_summary = summary(.))
    head(pam_results$the_summary)
    

    Furthermore, since you are working only with binary data, instead of Gower's distance you can consider using the Simple Matching or Jaccard distance if they suit your data better. In R you can employ them using

    sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
    j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
    

    respectively, where p is the number of binary variables you want to consider.