I have followed this tutorial to create and visualise a PCA. The part Im particularly interested in is adding new data points to the existing model.
As the tutorial suggests, one would use predict (ir.pca, newdata=tail(log.ir, 2)) to predict new PCs. But how do I add these new observations to the existing plot ? It doesnt look like predict function returns the same object as the ir.pca used in ggplot function.
I have found similar questions here and here but these are calculating new PCA scores and adding them to the variance plot (if I understood it correctly).
Ultimately what Im after is to see whether new points fall in within the confidence ellipse defined/derived using the initial dataset.
The code I'm using from the tutorial:
# log transform
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]
# apply PCA - scale. = TRUE is highly
# advisable, but default is FALSE.
ir.pca <- prcomp(log.ir,
center = TRUE,
scale. = TRUE)
library(devtools)
install_github("ggbiplot", "vqv")
library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1,
groups = ir.species, ellipse = TRUE,
circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)
And as the tutorial suggests I'd like to add new data which came in to the existing plot visualised with ggplot
Thanks
When we inspect the ggplot
object, we see it has an element named data
:
str(g)
# List of 9
# $ data :'data.frame': 150 obs. of 3 variables:
# ..$ xvar : num [1:150] -2.41 -2.22 -2.58 -2.45 -2.54 ...
# ..$ yvar : num [1:150] -0.397 0.69 0.428 0.686 -0.508 ...
# ..$ groups: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ layers :List of 5
# <snip>
Hence we can just add the new data points to the data
dataframe. Suppose these 10 observations from iris
are our "new" observations, and we predict their PC values:
set.seed(123)
x <- sample(seq_len(nrow(iris)), 10)
predicted <- predict(ir.pca, newdata = log.ir[x, ])
We can add these predicted values to the data
dataframe
g$data <- rbind(g$data,
data.frame(
xvar = predicted[, "PC1"],
yvar = predicted[, "PC2"],
groups = "new"
)
)