I have a data matrix called mydf
which contains the 10 principal components(10 dimensions) in galactic space with 5 samples. I want to find the centroid (gravitational center) of the samples using all PCs and the the distance for each samples from that centroid. How can we do this in R?
mydf<- structure(list(Sample = c("1", "2", "4", "5", "6"), PCA.1 = c(0.00338,
-0.020373, -0.019842, -0.019161, -0.019594), PCA.2 = c(0.00047,
-0.010116, -0.011532, -0.011582, -0.013245), PCA.3 = c(-0.008787,
0.001412, 0.003751, 0.00371, 0.004242), PCA.4 = c(0.011242, 0.000882,
-0.003662, -0.002206, -0.002449), PCA.5 = c(0.055873, -0.022664,
-0.014058, -0.024757, -0.020033), PCA.6 = c(-0.001511, 0.006226,
-0.005417, 0.000522, -0.003114), PCA.7 = c(-0.056734, -0.007418,
-0.01043, -0.006961, -0.006006), PCA.8 = c(0.005189, 0.008031,
-0.002979, 0.000743, 0.006276), PCA.9 = c(0.008169, -0.000265,
0.010893, 0.003233, 0.007316), PCA.10 = c(-0.000461, -0.003893,
0.008549, 0.005556, -0.001499)), .Names = c("Sample", "PCA.1",
"PCA.2", "PCA.3", "PCA.4", "PCA.5", "PCA.6", "PCA.7", "PCA.8",
"PCA.9", "PCA.10"), row.names = c(NA, 5L), class = "data.frame")
For example, this is the PCA plot (obviously in 2D) for these 5 samples for which I need to find the centroid using all 10 dimensions first. Then need to calculate the distance for each sample from that one centroid.
I don't think it would be that difficult to show that for equally weighted masses at the ten-dimensional points given by those 5 vectors that the sum of squared distances from a point would be minimized for a point at:
> centroid = colMeans(mydf[-1])
> centroid
PCA.1 PCA.2 PCA.3 PCA.4 PCA.5 PCA.6 PCA.7 PCA.8 PCA.9 PCA.10
-0.0151180 -0.0092010 0.0008656 0.0007614 -0.0051278 -0.0006588 -0.0175098 0.0034520 0.0058692 0.0016504
And then the distances would be:
> rowSums( sweep(mydf[-1], 2, centroid, "-")^2 )
1 2 3 4 5
0.0059118459 0.0005748535 0.0003223413 0.0005664300 0.0004386126
For plotting the values in the first two "dimensions" I would use this instead:
with(mydf, plot(PCA.2 ~ PCA.1 ))
points( x= -0.0151180, y= -0.0092010, col='red', pch=24)