Search code examples
rdendrogramhclust

How to replace dendrogram labels using only base R and/or ggplot2 packages?


I want to perform dendrogram visualization using hierarchical grouping with Minkowski method on my dataset from eurostat library. I want to make values shown in this dendrogram:

to display country names like in this one

I can only use base R packages and/or ggplot2 due to project's requirements.

Use this code to recreate my situation:

install.packages("eurostat")
install.packages("dplyr")
install.packages("ggplot2")
library(eurostat)
library(dplyr)
library(ggplot2)

member_states <- c("AT", "BE", "BG", "HR", "CY", "CZ",
                        "DK", "EE", "FI", "FR", "DE", "GR", 
                        "HU", "IE", "IT", "LV", "LT", "LU", 
                        "MT", "NL", "PL", "PT", "RO", "SK", 
                        "SI", "ES", "SE", "EL")

hicp <- get_eurostat("prc_hicp_manr", time_format = "date")

hicp_filtered <- hicp %>% filter(time >= as.Date("2000-02-01")
                               & time <= as.Date("2022-09-01")) %>%
                          filter(coicop == "CP00") %>%
                          filter(geo %in% member_states) %>%
                          mutate(geo = case_when(
                            geo == "AT" ~ "Austria",
                            geo == "BE" ~ "Belgium",
                            geo == "BG" ~ "Bulgaria",
                            geo == "HR" ~ "Croatia",
                            geo == "CY" ~ "Cyprus",
                            geo == "CZ" ~ "Czech Republic",
                            geo == "DK" ~ "Denmark",
                            geo == "EE" ~ "Estonia",
                            geo == "FI" ~ "Finland",
                            geo == "FR" ~ "France",
                            geo == "DE" ~ "Germany",
                            geo == "GR" ~ "Greece",
                            geo == "HU" ~ "Hungary",
                            geo == "IE" ~ "Ireland",
                            geo == "IT" ~ "Italy",
                            geo == "LV" ~ "Latvia",
                            geo == "LT" ~ "Lithuania",
                            geo == "LU" ~ "Luxembourg",
                            geo == "MT" ~ "Malta",
                            geo == "NL" ~ "Netherlands",
                            geo == "PL" ~ "Poland",
                            geo == "PT" ~ "Portugal",
                            geo == "RO" ~ "Romania",
                            geo == "SK" ~ "Slovakia",
                            geo == "SI" ~ "Slovenia",
                            geo == "ES" ~ "Spain",
                            geo == "SE" ~ "Sweden",
                            geo == "EL" ~ "Greece",
                            TRUE ~ geo
                          ))

data <- hicp_filtered[, c(3,4,5)]

data_widened <- reshape(transform(data, 
                        id = ave(seq_along(geo), geo, FUN = seq_along)), 
                        idvar = c("id", "time"), 
                        direction = "wide", timevar = "geo")

To perform that classification analysis I tried to write this code:

distance_matrix <- dist(data_widened[3:29, ], method = "minkowski", p = 1.5)
hc <- hclust(distance_matrix, method = "ward.D2")
plot(hc)

How can I replace those weird values with country names and align clusters on my plot too look like in the desired form?

Thanks in advance.


Solution

  • You have got the row and column indices round the wrong way, and you also need to transpose the data.

    # Remove "values." from the names of each column
    names(data_widened) <- gsub("values\\.", "", names(data_widened))
    
    distance_matrix <- dist(t(data_widened[,3:29]), method = "minkowski", p = 1.5)
    hc <- hclust(distance_matrix, method = "ward.D2")
    plot(hc)
    

    enter image description here