Search code examples
rpca

PCA-LDA analysis - R


In this example (https://gist.github.com/thigm85/8424654) LDA was examined vs. PCA on iris dataset. How can I also do LDA on the PCA results (PCA-LDA) ?

Code:

require(MASS)
require(ggplot2)
require(scales)
require(gridExtra)

pca <- prcomp(iris[,-5],
              center = TRUE,
              scale. = TRUE) 

prop.pca = pca$sdev^2/sum(pca$sdev^2)

lda <- lda(Species ~ ., 
           iris, 
           prior = c(1,1,1)/3)

prop.lda = lda$svd^2/sum(lda$svd^2)

plda <- predict(object = lda,
                newdata = iris)

dataset = data.frame(species = iris[,"Species"],
                     pca = pca$x, lda = plda$x)

p1 <- ggplot(dataset) + geom_point(aes(lda.LD1, lda.LD2, colour = species, shape = species), size = 2.5) + 
  labs(x = paste("LD1 (", percent(prop.lda[1]), ")", sep=""),
       y = paste("LD2 (", percent(prop.lda[2]), ")", sep=""))

p2 <- ggplot(dataset) + geom_point(aes(pca.PC1, pca.PC2, colour = species, shape = species), size = 2.5) +
  labs(x = paste("PC1 (", percent(prop.pca[1]), ")", sep=""),
       y = paste("PC2 (", percent(prop.pca[2]), ")", sep=""))

grid.arrange(p1, p2)

Solution

  • Usually you do PCA-LDA to reduce the dimensions of your data before performing PCA. Ideally you decide the first k components to keep from the PCA. In your example with iris, we take the first 2 components, otherwise it will look pretty much the same as without PCA.

    Try it like this:

    pcdata = data.frame(pca$x[,1:2],Species=iris$Species)
    pc_lda <- lda(Species ~ .,data=pcdata , prior = c(1,1,1)/3)
    prop_pc_lda = pc_lda$svd^2/sum(pc_lda$svd^2)
    pc_plda <- predict(object = pc_lda,newdata = pcdata)
    
    dataset = data.frame(species = iris[,"Species"],pc_plda$x)
    
    p3 <- ggplot(dataset) + geom_point(aes(LD1, LD2, colour = species, shape = species), size = 2.5) + 
      labs(x = paste("LD1 (", percent(prop_pc_lda[1]), ")", sep=""),
           y = paste("LD2 (", percent(prop_pc_lda[2]), ")", sep=""))
    
    print(p3)
    

    enter image description here

    You don't see much of a difference here because the first 2 components of the PCA captures most of the variance in the iris dataset.