Search code examples
ralgorithmmachine-learningcluster-analysishierarchical-clustering

Clustering a set of countries based on cultural similarity on R


I am having some problems trying to cluster countries using a sort of cultural correlation that I already have.

basically, the dataset looks like this: with 90 countries, 91 columns (90 country columns + one to identify the nations on the rows) and 90 rows

 Nation Ita   Fra   Ger   Esp   Eng  ...
 Ita    NA    0.2   0.1   0.6   0.4  ...
 Fra    0.2   NA    0.2   0.1   0.3  ...
 Ger    0.7   0.1   NA    0.5   0.4
 Esp    0.6   0.1   0.5   NA    0.2
 Eng    0.4   0.3   0.4   0.2   NA
 ...                              .....
 ...

I am looking for an algorithm that clusters my countries in groups (for instance groups of 3, or even better, more flexible clusters, such that the number of clusters and the number of countries per cluster is not fixed ex-ante

so that the output is for instance

  Nation   cluster
  Ita       1
  Fra       2
  Ger       3
  Esp       1
  Eng       3
  ......

Solution

  • #DATA
    df1 = read.table(strip.white = TRUE, stringsAsFactors = FALSE, header = TRUE, text =
    "Nation Ita   Fra   Ger   Esp   Eng
     Ita    NA    0.2   0.1   0.6   0.4
     Fra    0.2   NA    0.2   0.1   0.3
     Ger    0.7   0.1   NA    0.5   0.4
     Esp    0.6   0.1   0.5   NA    0.2
     Eng    0.4   0.3   0.4   0.2   NA")
    
    df1 = replace(df1, is.na(df1), 0)
    row.names(df1) = df1[,1]
    df1 = df1[,-1]
    
    # Run PCA to visualize similarities
    pca = prcomp(as.matrix(df1))    
    pca_m = as.data.frame(pca$x)
    plot(pca_m$PC1, pca_m$PC2)
    text(x = pca_m$PC1, pca_m$PC2, labels = row.names(df1))
    

    enter image description here

    # Run k-means and choose centers based on pca plot
    kk = kmeans(x = df1, centers = 3)
    kk$cluster
    # Ita Fra Ger Esp Eng 
    #   3   1   2   1   1