Search code examples
rmachine-learningdata-miningpca

Dynamically selecting principal components from the PCA output


This seems a trivial problem but i am unable to get the issue resolved!

I have taken numeric columns of iris data set ..then normalized it as below

newiris<-iris[,1:4]
iris.norm<-data.frame(scale(newiris))
head(iris.norm)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.8976739  1.01560199    -1.335752   -1.311052
2   -1.1392005 -0.13153881    -1.335752   -1.311052
3   -1.3807271  0.32731751    -1.392399   -1.311052
4   -1.5014904  0.09788935    -1.279104   -1.311052
5   -1.0184372  1.24503015    -1.335752   -1.311052
6   -0.5353840  1.93331463    -1.165809   -1.048667

# performed PCA now
pccomp <- prcomp(iris.norm )
summary(pccomp)
a <- summary(pccomp)
df<- as.data.frame(a$importance)
df <- t(df)
df
##     Standard deviation Proportion of Variance Cumulative Proportion
## PC1          1.7083611                0.72962               0.72962
## PC2          0.9560494                0.22851               0.95813
## PC3          0.3830886                0.03669               0.99482
## PC4          0.1439265                0.00518               1.00000

Now converting rownames into a column for df so that PCs which were rownames forms the first column for further manipulation

   library(tibble)
   library(dplyr)
   df<-rownames_to_column(as.data.frame(df), var="PrinComp") %>% head
   df
   ##   PrinComp Standard deviation Proportion of Variance Cumulative Proportion
   ## 1      PC1          1.7083611                0.72962               0.72962
   ## 2      PC2          0.9560494                0.22851               0.95813
   ## 3      PC3          0.3830886                0.03669               0.99482
   ## 4      PC4          0.1439265                0.00518               1.00000

 # Now will be selecting only those PCs where the cumulative proportion is say less than 96%
# subsetting
pcs<-as.vector(as.character(df[which(df$`Cumulative Proportion`<0.96),][,1])) # cumulative prop less than 96%
pcs
## [1] "PC1" "PC2"

Now i am creating a PC data frame statically of vector scores from the first 2 principal components which we got from the above condition (cum prop<0.96)

 x1 <- pccomp$x[,1]
 x2 <- pccomp$x[,2]
 pcdf <- cbind(x1,x2)
 head(pcdf)
##             x1         x2
## [1,] -2.257141 -0.4784238
## [2,] -2.074013  0.6718827
## [3,] -2.356335  0.3407664
## [4,] -2.291707  0.5953999
## [5,] -2.381863 -0.6446757
## [6,] -2.068701 -1.4842053

My issue is how can i create the above pc data frame dynamically once i know the no of PCs based on condition such as cumulative proportion say being less than 0.95??


Solution

  • You can just run a while loop on the df's cumulative proportion field and append the transformed value till it's less than the required threshold.

    threshold = 0.96
    pcdf = list()
    i    = 1
    while(df$`Cumulative Proportion`[i]<threshold){
        pcdf[[i]] = pccomp$x[,i]
        i = i +1
    }
    pcdf = as.data.frame(pcdf)
    
    names(pcdf) = paste("x",c(1:ncol(pcdf)),sep="")
    

    The output

    > head(pcdf)
             x1         x2
    1 -2.257141 -0.4784238
    2 -2.074013  0.6718827
    3 -2.356335  0.3407664
    4 -2.291707  0.5953999
    5 -2.381863 -0.6446757
    6 -2.068701 -1.4842053
    

    when the threshold = 0.999 running the same code gives

    > head(pcdf)
             x1         x2          x3
    1 -2.257141 -0.4784238  0.12727962
    2 -2.074013  0.6718827  0.23382552
    3 -2.356335  0.3407664 -0.04405390
    4 -2.291707  0.5953999 -0.09098530
    5 -2.381863 -0.6446757 -0.01568565
    6 -2.068701 -1.4842053 -0.02687825
    

    UPDATE

    Assuming you know the number of principle component you want say i.you can use

    a <- sapply(X = c(1:i),FUN = function(X){pcdf[[X]] = pccomp$x[,X]})
    

    instead of the whole while loop section. so for i = 2 you get

    > head(a)
              [,1]       [,2]
    [1,] -2.257141 -0.4784238
    [2,] -2.074013  0.6718827
    [3,] -2.356335  0.3407664
    [4,] -2.291707  0.5953999
    [5,] -2.381863 -0.6446757
    [6,] -2.068701 -1.4842053
    

    where a is your result.