I am having some problems trying to cluster countries using a sort of cultural correlation that I already have.
basically, the dataset looks like this: with 90 countries, 91 columns (90 country columns + one to identify the nations on the rows) and 90 rows
Nation Ita Fra Ger Esp Eng ...
Ita NA 0.2 0.1 0.6 0.4 ...
Fra 0.2 NA 0.2 0.1 0.3 ...
Ger 0.7 0.1 NA 0.5 0.4
Esp 0.6 0.1 0.5 NA 0.2
Eng 0.4 0.3 0.4 0.2 NA
... .....
...
I am looking for an algorithm that clusters my countries in groups (for instance groups of 3, or even better, more flexible clusters, such that the number of clusters and the number of countries per cluster is not fixed ex-ante
so that the output is for instance
Nation cluster
Ita 1
Fra 2
Ger 3
Esp 1
Eng 3
......
#DATA
df1 = read.table(strip.white = TRUE, stringsAsFactors = FALSE, header = TRUE, text =
"Nation Ita Fra Ger Esp Eng
Ita NA 0.2 0.1 0.6 0.4
Fra 0.2 NA 0.2 0.1 0.3
Ger 0.7 0.1 NA 0.5 0.4
Esp 0.6 0.1 0.5 NA 0.2
Eng 0.4 0.3 0.4 0.2 NA")
df1 = replace(df1, is.na(df1), 0)
row.names(df1) = df1[,1]
df1 = df1[,-1]
# Run PCA to visualize similarities
pca = prcomp(as.matrix(df1))
pca_m = as.data.frame(pca$x)
plot(pca_m$PC1, pca_m$PC2)
text(x = pca_m$PC1, pca_m$PC2, labels = row.names(df1))
# Run k-means and choose centers based on pca plot
kk = kmeans(x = df1, centers = 3)
kk$cluster
# Ita Fra Ger Esp Eng
# 3 1 2 1 1