Search code examples
rcluster-analysisprincipal-components

Extracting useful information from K-Means on Principal Components


I am working with a relatively big data set (only using about 1/32 of it, but this subset is approx. 50000x9000). In order to perform analysis on this, I have taken several steps to reduce the dimensionality, so that I can then apply some sort of clustering algorithm.

Take a look at the following data frame:

set.seed(340)
df = data.frame(replicate(10,sample(0:10,size = 10,replace = TRUE)))
> df
   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1   4  9  4  6  9  4  2  5  8   8
2   5  8  2  0  4  6  1  1  0  10
3   1  7  6  3  5  9  6  0  7   1
4   0  6  8  6  6  0  5  5 10  10
5   2  0  5  8  2 10  8  2  1   5
6   3  9 10  2  8  5  2 10  3  10
7   9  0  1  0  6  8  9  6  5   0
8   5  6  9  3 10  4  4  8  6   9
9   8  7  6  2 10  9  9  7  1  10
10  0  7  2  6  1  6  3  2  3   9

Each row represents a person, and each variable says how often that person exhibited that quality. Say I perform principal component analysis on this using princomp(), and collect the first four pc's to use for k means.

pc = princomp(df)
new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
fit = kmeans(new_df,2)

From this I can deduce what cluster exhibits high values of what principal components, where I can use the loadings to see what each principal component is a general measure off. However, I would like to ultimately connect this information to my original data set. Is there a way that I can cluster each person in the original data to a cluster created from the k means on the principal component analysis? Or am I misunderstanding the concept of PCA.


Solution

  • pc$loadings finds the coordinates of the input variables, not that of the individuals. So kmeans(new_df,2) classifies variables and not individuals. To make sure of this, try your code with a 10x5 data.frame instead of a 10x10 one : you only get 5 cluster coordinates:

    df = data.frame(replicate(5,sample(0:10,size = 10,replace = TRUE)))
    pc = princomp(df)
    new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
    fit = kmeans(new_df,2)
    fit$cluster
    X1 X2 X3 X4 X5 
     2  2  1  2  2 
    

    If that is what you want to do, then you can just rbind fit$cluster to your original data.frame and you will have the cluster of your variables.

    df <- rbind(df,fit$cluster)
    

    However, if you intended to cluster individuals, i.e. rows of your original data.frame, you need to perform the clustering on the row coordinates produced by the principal component analysis. I don't know how to access those in princomp, but other pca methods allow this easily. FactoMineR::PCA outputs a list with row coordinates ($ind$coord) and column coordinates ($var$coord).

    library(FactoMineR)
    pf <- PCA(df,graph=FALSE)
    
    fit <- kmeans(pf$ind$coord[,1:4],2)
    
    fit$cluster
     1  2  3  4  5  6  7  8  9 10 
     1  2  1  1  1  2  1  1  1  2 
    

    To add those to your original data.frame:

    df$cluster <- fit$cluster