Search code examples
rpca

Calculate the multidimensional distance from the center of the galactic space


I have a data matrix called mydf which contains the 10 principal components(10 dimensions) in galactic space with 5 samples. I want to find the centroid (gravitational center) of the samples using all PCs and the the distance for each samples from that centroid. How can we do this in R?

   mydf<-  structure(list(Sample = c("1", "2", "4", "5", "6"), PCA.1 = c(0.00338, 
    -0.020373, -0.019842, -0.019161, -0.019594), PCA.2 = c(0.00047, 
    -0.010116, -0.011532, -0.011582, -0.013245), PCA.3 = c(-0.008787, 
    0.001412, 0.003751, 0.00371, 0.004242), PCA.4 = c(0.011242, 0.000882, 
    -0.003662, -0.002206, -0.002449), PCA.5 = c(0.055873, -0.022664, 
    -0.014058, -0.024757, -0.020033), PCA.6 = c(-0.001511, 0.006226, 
    -0.005417, 0.000522, -0.003114), PCA.7 = c(-0.056734, -0.007418, 
    -0.01043, -0.006961, -0.006006), PCA.8 = c(0.005189, 0.008031, 
    -0.002979, 0.000743, 0.006276), PCA.9 = c(0.008169, -0.000265, 
    0.010893, 0.003233, 0.007316), PCA.10 = c(-0.000461, -0.003893, 
    0.008549, 0.005556, -0.001499)), .Names = c("Sample", "PCA.1", 
    "PCA.2", "PCA.3", "PCA.4", "PCA.5", "PCA.6", "PCA.7", "PCA.8", 
    "PCA.9", "PCA.10"), row.names = c(NA, 5L), class = "data.frame")

For example, this is the PCA plot (obviously in 2D) for these 5 samples for which I need to find the centroid using all 10 dimensions first. Then need to calculate the distance for each sample from that one centroid.

enter image description here


Solution

  • I don't think it would be that difficult to show that for equally weighted masses at the ten-dimensional points given by those 5 vectors that the sum of squared distances from a point would be minimized for a point at:

    > centroid = colMeans(mydf[-1])
    
    > centroid
         PCA.1      PCA.2      PCA.3      PCA.4      PCA.5      PCA.6      PCA.7      PCA.8      PCA.9     PCA.10 
    -0.0151180 -0.0092010  0.0008656  0.0007614 -0.0051278 -0.0006588 -0.0175098  0.0034520  0.0058692  0.0016504 
    

    And then the distances would be:

     > rowSums( sweep(mydf[-1], 2, centroid, "-")^2 )
               1            2            3            4            5 
    0.0059118459 0.0005748535 0.0003223413 0.0005664300 0.0004386126 
    

    For plotting the values in the first two "dimensions" I would use this instead:

    with(mydf, plot(PCA.2  ~    PCA.1 ))
    points( x= -0.0151180, y= -0.0092010, col='red', pch=24)