Search code examples
rclassificationrandom-forestpca

Calculate the transformation of a PCA in R?


I'm looking for the weights that represent the mapping from a dataset to its PCs. The aim is to set up a "calibrated" fixed space e.g. of three sorts of wine and when new observations e.g. a new sort of wine is introduced it can be allocated within the previously calibrated space without changing the fixed PC values. So new observations can be allocated appropriately by executing the transformation applied to the first three sorts.

 library(ggbiplot)
 data(wine)
 wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
 print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups =   wine.class, ellipse = TRUE, circle = TRUE))

Edit: The wine dataset is split into training data, to get what I called a calibrated space.

samp <- sample(nrow(wine), nrow(wine)*0.75)
wine.train <- wine[samp,]

Then subset a dataset to be validated using the training data e.g.

wine.valid <- wine[-samp,]

#PCA on training data
wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
#use the transformation matrix from the training data to predict the validation data
pred <- predict(wine.train.pca, newdata = wine.valid)

Subsequently, how to represent the calibrated space resulting from the training and the validation/testing data transformed is addressed in this thread.


Solution

  • This is easy to do with the predict function for prcomp. Below I show the performance by splitting up your wine data into two parts; training and validation datasets. The prediction of the validation PCA coordinates using the the prcomp-fitted PCA on the training set is then compared with those same coordinates as derived from the full dataset :

    library(ggbiplot)
    data(wine)
    
    # pca on whole dataset
    wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
    
    # pca on training part of dataset, then project new data onto pca coordinates 
    set.seed(1)
    samp <- sample(nrow(wine), nrow(wine)*0.75)
    wine.train <- wine[samp,]
    wine.valid <- wine[-samp,]
    wine.train.pca <- prcomp(wine.train, center = TRUE, scale. = TRUE)
    pred <- predict(wine.train.pca, newdata = wine.valid)
    
    # plot original vs predicted pca coordinates
    matplot(wine.pca$x[-samp,,1:4], pred[,1:4])
    

    enter image description here

    You can also look at the correlation between the predicted and original coordinates, and see that they are very high for the leading PCs:

    # correlation of predicted coordinates
    abs(diag(cor(wine.pca$x[-samp,], pred[,])))
    #       PC1       PC2       PC3       PC4       PC5       PC6       PC7       PC8       PC9      PC10 
    # 0.9991291 0.9955028 0.9882540 0.9418268 0.9681989 0.9770390 0.9603593 0.8991734 0.8090762 0.9326917 
    #      PC11      PC12      PC13 
    # 0.9270951 0.9596963 0.9397388 
    

    Edit:

    Here is an example of classification using randomForest:

    library(ggbiplot)
    data(wine)
    wine$class <- wine.class
    
    # install.packages("randomForest")
    library(randomForest)
    
    set.seed(1)
    train <- sample(nrow(wine), nrow(wine)*0.5)
    valid <- seq(nrow(wine))[-train]
    winetrain <- wine[train,]
    winevalid <- wine[valid,]
    
    modfit <- randomForest(class~., data=winetrain, nTree=500)
    pred <- predict(modfit, newdata=winevalid, type='class')
    

    The importance of each variable can be returned in the following way:

    importance(modfit) # importance of variables in predition
    #                MeanDecreaseGini
    # Alcohol               8.5032770
    # MalicAcid             1.3122286
    # Ash                   0.6827924
    # AlcAsh                1.9517369
    # Mg                    1.3632713
    # Phenols               2.7943536
    # Flav                  6.5798205
    # NonFlavPhenols        1.1712744
    # Proa                  1.2412928
    # Color                 8.7097870
    # Hue                   5.2674082
    # OD                    6.6101764
    # Proline              10.7032775
    

    And, the prediction accuracy is returned as follows:

    TAB <- table(pred, winevalid$class) # table of preditions vs. original classifications
    TAB
    # pred         barolo grignolino barbera
    #   barolo         29          1       0
    #   grignolino      1         30       0
    #   barbera         0          1      27
    
    sum(diag(TAB)) / sum(TAB) # overall accuracy
    # [1] 0.9662921