My objective to fit a linear model with the same response, but predictors replaced by factors/scores. I am trying to find out which principal components to include in such a linear model if I want to achieve an R^2 of at least 0.9*r.squared
from my original model.
Which predictors should I choose?
model1 <- lm(Resp~.,data=test_dat)
> summary(model1)
Call:
lm(formula = Resp ~ ., data = test_dat)
Residuals:
Min 1Q Median 3Q Max
-0.35934 -0.07729 0.00330 0.08204 0.38709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.18858 0.06926 -46.039 <2e-16 ***
Pred1 4.32083 0.03767 114.708 <2e-16 ***
Pred2 2.42110 0.04740 51.077 <2e-16 ***
Pred3 -1.00507 0.04435 -22.664 <2e-16 ***
Pred4 -3.19480 0.09147 -34.927 <2e-16 ***
Pred5 2.77779 0.05368 51.748 <2e-16 ***
Pred6 1.22923 0.05427 22.648 <2e-16 ***
Pred7 -1.21338 0.04562 -26.595 <2e-16 ***
Pred8 0.02485 0.05937 0.419 0.676
Pred9 -0.67831 0.05308 -12.778 <2e-16 ***
Pred10 1.69947 0.02628 64.672 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1193 on 489 degrees of freedom
Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.645e+04 on 10 and 489 DF, p-value: < 2.2e-16
My new model should have an R^2 >= 0.897
(threshold<-0.9*r.sqrd)
[1] 0.8973323
metrics.swiss <- calc.relimp(model1, type = c("lmg", "first", "last","betasq", "Pratt"))
metrics.swiss
[email protected]
>Pred1 Pred2 Pred3 Pred4 Pred5 Pred6 Pred7 Pred8 Pred9 Pred10
2 8 3 6 1 10 5 4 7 9
sum(metrics.swiss@lmg)
orderComponents<-c(5,1,3,8,7,4,9,2,10,6)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
swissRotatedReordered<-Rotated[,c(1,orderComponents+1)]
(nestedRSquared<-sapply(2:11,function(z)
summary(lm(Resp~.,data=swissRotatedReordered[,1:z]))$r.squared))
[1] 0.001041492 0.622569992 0.689046489 0.690319839 0.715051745 0.732286987
[7] 0.742441421 0.991291253 0.995263470 0.997035905
You run a linear model on the new model with your scores. "lmg" will allow you to see which factors made the most contribution and those are the factors you should keep. In my case it was the top 3 factors
predictors <- test_dat[-1]
Project.Data.PCA <- princomp(predictors)
summary(Project.Data.PCA)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
linModPCA<-lm(Resp~.,data=Rotated)
metrics.swiss <- calc.relimp(linModPCA, type = c("lmg", "first", "last","betasq",
"pratt"))
metrics.swiss