Search code examples
rdplyrcluster-analysishierarchical

Error and lack of names in distance matrix, dist()


I have some problems with alghorytmic hierarchical clustering by Minkowski method. That's my code

install.packages("eurostat")
install.packages("ggplot2")
install.packages("dplyr")
library(eurostat)
library(ggplot2)
library(dplyr)


unique(data$geo)

data <- get_eurostat("prc_hicp_manr")

data <- filter(data, time >= "2000-03-01" & time <= "2022-09-30" & coicop == "CP00")


country_name <- eurostat::eu_countries


data <- merge(data, country_name, by.x = "geo", by.y = "code")

unique(data$name)

data <- filter(data, name != "" & name != "United Kingdom")

unique(data$name)

sum(is.na(data$name)) # Mamy 0 NAs, a zatem w naszych danych sa tylko kraje EU-27



ggplot(data) +
  aes(x = time, y = values, colour = name) +
  geom_line() +
  scale_color_hue(direction = 1) +
  labs(title = "Przebiegi HICP dla krajow UE") +
  theme_bw() +
  theme(plot.title = element_text(size = 20L, hjust = 0.5))

data_by_country <- data %>%
  group_by(name) %>%
  summarize(HICP = mean(values)) %>%
  ungroup() %>%
  as.matrix()

any(is.na(data_by_country))

data_by_country <- na.omit(data_by_country)

distance_matrix <- as.dist(dist(data_by_country, p = 1.5, method = "minkowski"))

Everything is fine until data_by_country line. By running this I got nice table with two columns, about names of countries and their HICP. There is no any NAs in this table. That's the first 6 rows from this table.

      name     HICP
1  Austria 2.102952
2  Belgium 2.232472
3 Bulgaria 4.036531
4  Croatia 2.410332
5   Cyprus 1.833579
6  Czechia 2.621033

After that, I want to create distance matrix based on that table. Unfortunately, I got this error:

Warning message:
In dist(data_by_country, p = 1.5, method = "minkowski") :
  NAs introduced by coercion

But there is literally no NAs in my table. Additionally, when I use function:

distance_matrix <- as.dist(dist(data_by_country, p = 1.5, method = "minkowski"))

I got a matrix with values, but there are no names in columns and rows. There are only numbers.

Have you got any idea what's wrong with that code? I am new to R, so I think it could be easy but i have no idea what to do now.

IMPORTANT: I can use only dplyr, ggplot2 and eurostat packages in that exercise

That's the plot I want to get. Is it a right way to do that?enter image description here


Solution

  • Add row names to your data frame, pick only the data column and make sure the data is not just a vector (drop=F).

    rownames(data_by_country) <- data_by_country$name
    
    plot(hclust(dist(data_by_country[, "HICP", drop=F])))
    

    hclust