I'm looking for the weights that represent the mapping from a dataset to its PCs. The aim is to set up a "calibrated" fixed space e.g. of three sorts of wine and when new observations e.g. a new sort of wine is introduced it can be allocated within the previously calibrated space without changing the fixed PC values. So new observations can be allocated appropriately by executing the transformation applied to the first three sorts.
library(ggbiplot)
data(wine)
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE))
Edit: The wine dataset is split into training data, to get what I called a calibrated space.
samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]
Then subset a dataset to be validated using the training data e.g.
wine.valid <- wine[-samp,]
#PCA on training data
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
#use the transformation matrix from the training data to predict the validation data
pred <- predict(wine.train.pca, newdata = wine.valid)
Subsequently, how to represent the calibrated space resulting from the training and the validation/testing data transformed is addressed in this thread.
This is easy to do with the predict
function for prcomp
. Below I show the performance by splitting up your wine data into two parts; training and validation datasets. The prediction of the validation PCA coordinates using the the prcomp-fitted PCA on the training set is then compared with those same coordinates as derived from the full dataset :
library(ggbiplot)
data(wine)
# pca on whole dataset
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
# pca on training part of dataset, then project new data onto pca coordinates
set.seed(1)
samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]
wine.valid <- wine[-samp,]
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
pred <- predict(wine.train.pca, newdata = wine.valid)
# plot original vs predicted pca coordinates
matplot(wine.pca$x[-samp,,1:4], pred[,1:4])
You can also look at the correlation between the predicted and original coordinates, and see that they are very high for the leading PCs:
# correlation of predicted coordinates
abs(diag(cor(wine.pca$x[-samp,], pred[,])))
# PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
# 0.9991291 0.9955028 0.9882540 0.9418268 0.9681989 0.9770390 0.9603593 0.8991734 0.8090762 0.9326917
# PC11 PC12 PC13
# 0.9270951 0.9596963 0.9397388
Here is an example of classification using randomForest
:
library(ggbiplot)
data(wine)
wine$class <- wine.class
# install.packages("randomForest")
library(randomForest)
set.seed(1)
train <- sample(nrow(wine), nrow(wine)*0.5)
valid <- seq(nrow(wine))[-train]
winetrain <- wine[train,]
winevalid <- wine[valid,]
modfit <- randomForest(class~., data=winetrain, nTree=500)
pred <- predict(modfit, newdata=winevalid, type='class')
The importance of each variable can be returned in the following way:
importance(modfit) # importance of variables in predition
# MeanDecreaseGini
# Alcohol 8.5032770
# MalicAcid 1.3122286
# Ash 0.6827924
# AlcAsh 1.9517369
# Mg 1.3632713
# Phenols 2.7943536
# Flav 6.5798205
# NonFlavPhenols 1.1712744
# Proa 1.2412928
# Color 8.7097870
# Hue 5.2674082
# OD 6.6101764
# Proline 10.7032775
And, the prediction accuracy is returned as follows:
TAB <- table(pred, winevalid$class) # table of preditions vs. original classifications
TAB
# pred barolo grignolino barbera
# barolo 29 1 0
# grignolino 1 30 0
# barbera 0 1 27
sum(diag(TAB)) / sum(TAB) # overall accuracy
# [1] 0.9662921