I have multidimensional(100+ variables) data, subset of which I'm expecting to, more or less, conform to a plane. What would be the best way to fit a plane to that subset in R?
I'd like to use the plane to calculate distance of some other points to it and to plot some dimensions of it.
Principal components can solve this for you. Assuming that your data really does match a plane, the first two principal components should describe that plane well.
You do not provide any sample data, so I will illustrate with some artificial data. My data is ten dimensional, but all points lie close to a plane (with some error in the other eight directions).
## Sample data
set.seed(2018)
NPts = 1000
x = runif(NPts)
y = runif(NPts)
cx = rnorm(1)
cy = rnorm(1)
V1 = cx*x + cy*y + rnorm(NPts, 0, 0.1)
MyData = data.frame(V1)
for(i in 2:10) {
cx=rnorm(1)
cy= rnorm(1)
name = paste0("V", i)
MyData[,name] = cx*x + cy*y + rnorm(NPts, 0, 0.1)
}
Since all variables are linear combinations of x and y (plus a small error), the data is only two dimensional and lives near the x-y plane. Here I am treating x and y as latent variables. They do not appear in the data but drive the behavior of all the other variables.
## Principal Components Analysis
PCA = prcomp(MyData)
plot(PCA)
Yep, the data looks basically two dimensional. All that remains is to get the first two principal components. They are stored in the structure returned from prcomp
.
PCA$rotation[,1:2]
PC1 PC2
V1 0.42752681 -0.204894748
V2 -0.64546573 -0.056503044
V3 0.04606707 -0.009614603
V4 0.01956126 -0.539070667
V5 0.15987617 0.600122935
V6 -0.06255399 0.054053476
V7 0.26497132 0.388920891
V8 0.21645814 -0.366709584
V9 0.49363625 -0.116954131
V10 0.08874645 0.040656622
The plane that we are looking for is the plane spanned by these two vectors.