I want to do a PCA analysis using prcomp
with a dataset that has duplicate factors in the first two columns followed by numerical vectors:
Genus1 Species1 6.320000 8.720000 6.420000
Genus2 Species2 8.430000 11.780000 4.490000
Genus2 Species2 8.310000 10.940000 4.180000
Genus3 Species3 9.290000 13.060000 5.990000
Genus3 Species3 8.960000 13.320000 6.36000
How can I turn this dataset into the correct format to run with prcomp
such that the PC scores will in the same order as the original dataset?
Let's say your data is:
x = structure(list(V1 = structure(c(1L, 2L, 2L, 3L, 3L), .Label = c("Genus1",
"Genus2", "Genus3"), class = "factor"), V2 = structure(c(1L,
2L, 2L, 3L, 3L), .Label = c("Species1", "Species2", "Species3"
), class = "factor"), V3 = c(6.32, 8.43, 8.31, 9.29, 8.96), V4 = c(8.72,
11.78, 10.94, 13.06, 13.32), V5 = c(6.42, 4.49, 4.18, 5.99, 6.36
)), class = "data.frame", row.names = c(NA, -5L))
You cannot do pca with factors anyway, so do:
pca = prcomp(x[,3:5])
pca_scores = cbind(x[,1:2],pca$x)
pca_scores
V1 V2 PC1 PC2 PC3
1 Genus1 Species1 -3.4571239 0.8812539 0.003197962
2 Genus2 Species2 0.2914003 -0.9790128 -0.165842662
3 Genus2 Species2 -0.4813849 -1.3641274 0.099844800
4 Genus3 Species3 1.8024971 0.5080058 0.199344981
5 Genus3 Species3 1.8446114 0.9538805 -0.136545080