Search code examples
rpcafactors

PCA with new Factors in R


My objective to fit a linear model with the same response, but predictors replaced by factors/scores. I am trying to find out which principal components to include in such a linear model if I want to achieve an R^2 of at least 0.9*r.squared from my original model. Which predictors should I choose?

  model1 <- lm(Resp~.,data=test_dat)
  > summary(model1)

  Call:
  lm(formula = Resp ~ ., data = test_dat)

Residuals:
 Min       1Q   Median       3Q      Max 
 -0.35934 -0.07729  0.00330  0.08204  0.38709 

 Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.18858    0.06926 -46.039   <2e-16 ***
Pred1        4.32083    0.03767 114.708   <2e-16 ***
Pred2        2.42110    0.04740  51.077   <2e-16 ***
Pred3       -1.00507    0.04435 -22.664   <2e-16 ***
Pred4       -3.19480    0.09147 -34.927   <2e-16 ***
Pred5        2.77779    0.05368  51.748   <2e-16 ***
Pred6        1.22923    0.05427  22.648   <2e-16 ***
 Pred7       -1.21338    0.04562 -26.595   <2e-16 ***
Pred8        0.02485    0.05937   0.419    0.676    
Pred9       -0.67831    0.05308 -12.778   <2e-16 ***
Pred10       1.69947    0.02628  64.672   <2e-16 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Residual standard error: 0.1193 on 489 degrees of freedom
  Multiple R-squared:  0.997,   Adjusted R-squared:  0.997 
  F-statistic: 1.645e+04 on 10 and 489 DF,  p-value: < 2.2e-16

My new model should have an R^2 >= 0.897

    (threshold<-0.9*r.sqrd)
    [1] 0.8973323

 metrics.swiss <- calc.relimp(model1, type = c("lmg", "first", "last","betasq", "Pratt"))
 metrics.swiss
  [email protected]
  >Pred1  Pred2  Pred3  Pred4  Pred5  Pred6  Pred7  Pred8  Pred9 Pred10 
   2      8      3      6      1     10      5      4      7      9 

 sum(metrics.swiss@lmg)
 orderComponents<-c(5,1,3,8,7,4,9,2,10,6)
 PCAFactors<-Project.Data.PCA$scores
 Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
 swissRotatedReordered<-Rotated[,c(1,orderComponents+1)]
  (nestedRSquared<-sapply(2:11,function(z) 
  summary(lm(Resp~.,data=swissRotatedReordered[,1:z]))$r.squared))
 [1] 0.001041492 0.622569992 0.689046489 0.690319839 0.715051745 0.732286987
 [7] 0.742441421 0.991291253 0.995263470 0.997035905

Solution

  • You run a linear model on the new model with your scores. "lmg" will allow you to see which factors made the most contribution and those are the factors you should keep. In my case it was the top 3 factors

     predictors <- test_dat[-1]
     Project.Data.PCA <- princomp(predictors)
     summary(Project.Data.PCA)
    PCAFactors<-Project.Data.PCA$scores
    Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
    linModPCA<-lm(Resp~.,data=Rotated)
    metrics.swiss <- calc.relimp(linModPCA, type = c("lmg", "first", "last","betasq", 
    "pratt"))
     metrics.swiss