Extracting useful information from K-Means on Principal Components

I am working with a relatively big data set (only using about 1/32 of it, but this subset is approx. 50000x9000). In order to perform analysis on this, I have taken several steps to reduce the dimensionality, so that I can then apply some sort of clustering algorithm.

Take a look at the following data frame:

set.seed(340)
df = data.frame(replicate(10,sample(0:10,size = 10,replace = TRUE)))
> df
   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1   4  9  4  6  9  4  2  5  8   8
2   5  8  2  0  4  6  1  1  0  10
3   1  7  6  3  5  9  6  0  7   1
4   0  6  8  6  6  0  5  5 10  10
5   2  0  5  8  2 10  8  2  1   5
6   3  9 10  2  8  5  2 10  3  10
7   9  0  1  0  6  8  9  6  5   0
8   5  6  9  3 10  4  4  8  6   9
9   8  7  6  2 10  9  9  7  1  10
10  0  7  2  6  1  6  3  2  3   9

Each row represents a person, and each variable says how often that person exhibited that quality. Say I perform principal component analysis on this using princomp(), and collect the first four pc's to use for k means.

pc = princomp(df)
new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
fit = kmeans(new_df,2)

From this I can deduce what cluster exhibits high values of what principal components, where I can use the loadings to see what each principal component is a general measure off. However, I would like to ultimately connect this information to my original data set. Is there a way that I can cluster each person in the original data to a cluster created from the k means on the principal component analysis? Or am I misunderstanding the concept of PCA.

Solution

pc$loadings finds the coordinates of the input variables, not that of the individuals. So kmeans(new_df,2) classifies variables and not individuals. To make sure of this, try your code with a 10x5 data.frame instead of a 10x10 one : you only get 5 cluster coordinates:

df = data.frame(replicate(5,sample(0:10,size = 10,replace = TRUE)))
pc = princomp(df)
new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
fit = kmeans(new_df,2)
fit$cluster
X1 X2 X3 X4 X5 
 2  2  1  2  2

If that is what you want to do, then you can just rbind fit$cluster to your original data.frame and you will have the cluster of your variables.

df <- rbind(df,fit$cluster)

However, if you intended to cluster individuals, i.e. rows of your original data.frame, you need to perform the clustering on the row coordinates produced by the principal component analysis. I don't know how to access those in princomp, but other pca methods allow this easily. FactoMineR::PCA outputs a list with row coordinates ($ind$coord) and column coordinates ($var$coord).

library(FactoMineR)
pf <- PCA(df,graph=FALSE)

fit <- kmeans(pf$ind$coord[,1:4],2)

fit$cluster
 1  2  3  4  5  6  7  8  9 10 
 1  2  1  1  1  2  1  1  1  2

To add those to your original data.frame:

df$cluster <- fit$cluster