I'm measuring the centroids of a PCA-space and a 'feature-space' spanning ~20 treatments and 3 groups. If I understood my maths teacher correctly the distance between them should be identical. However in the way I calculate them they're not and I was wondering if the way I do the math, either of them is wrong.
I use the notorious wine dataset as an illustration for my method/MWE:
library(ggbiplot)
data(wine)
treatments <- 1:2 #treatments to be considerd for this calculation
wine.pca <- prcomp(wine[treatments], scale. = TRUE)
#calculate the centroids for the feature/treatment space and the pca space
df.wine.x <- as.data.frame(wine.pca$x)
df.wine.x$groups <- wine.class
wine$groups <- wine.class
feature.centroids <- aggregate(wine[treatments], list(Type = wine$groups), mean)
pca.centroids <- aggregate(df.wine.x[treatments], list(Type = df.wine.x$groups), mean)
pca.centroids
feature.centroids
#calculate distance between the centroids of barolo and grignolino
dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")
The last two lines return 1.468087
for the distance in the feature-space and 1.80717
within the pca-space, indicating there's a fly in the ointment...
It's because of scaling and centering, if you don't do scaling and centering the distance will be exactly same in the original and PCA feature space.
wine.pca <- prcomp(wine[treatments], scale=FALSE, center=FALSE)
dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
# 1
# 2 1.468087
dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")
# 1
# 2 1.468087
Another way is to get the same result is to scale / center the original data and then apply PCA with scaling / centering like the following:
wine[treatments] <- scale(wine[treatments], center = TRUE)
wine.pca <- prcomp(wine[treatments], scale = TRUE)
dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
# 1
# 2 1.80717
dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")
# 1
# 2 1.80717