Search code examples
rmachine-learningdistancedata-miningpca

Centroid distance calculation in PCA space and in 'feature-space' diverge


I'm measuring the centroids of a PCA-space and a 'feature-space' spanning ~20 treatments and 3 groups. If I understood my maths teacher correctly the distance between them should be identical. However in the way I calculate them they're not and I was wondering if the way I do the math, either of them is wrong.

I use the notorious wine dataset as an illustration for my method/MWE:

library(ggbiplot)
data(wine)
treatments <- 1:2 #treatments to be considerd for this calculation
wine.pca <- prcomp(wine[treatments], scale. = TRUE)
#calculate the centroids for the feature/treatment space and the pca space
df.wine.x <- as.data.frame(wine.pca$x)
df.wine.x$groups <- wine.class
wine$groups <- wine.class
feature.centroids <- aggregate(wine[treatments], list(Type = wine$groups), mean)
pca.centroids <- aggregate(df.wine.x[treatments], list(Type = df.wine.x$groups), mean)
pca.centroids
feature.centroids
#calculate distance between the centroids of barolo and grignolino
dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")

The last two lines return 1.468087 for the distance in the feature-space and 1.80717 within the pca-space, indicating there's a fly in the ointment...


Solution

  • It's because of scaling and centering, if you don't do scaling and centering the distance will be exactly same in the original and PCA feature space.

    wine.pca <- prcomp(wine[treatments], scale=FALSE, center=FALSE)
    
    dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
    #         1
    # 2 1.468087
    dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")
    #         1
    # 2 1.468087
    

    Another way is to get the same result is to scale / center the original data and then apply PCA with scaling / centering like the following:

    wine[treatments] <- scale(wine[treatments], center = TRUE)
    wine.pca <- prcomp(wine[treatments], scale = TRUE)
    
    dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
    #        1
    # 2 1.80717
    dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")
    #        1
    # 2 1.80717