I am performing a PCA
to try to flush out the highly correlated variables real coefficients. I have a very large dataset but will try to simplify here. I have the formula:
lm(y~x1+x2+x3...x55) -> reg_linear_model
The issue I am having is that x1:x4
are all very highly correlated and some of them are coming in negative because of this. When I try I perform pca I get the list of components and their values. I would like to to test which components to use but the dependent Y is three years of data broken up by week so it is y1, y2, y3, y4, ....y156. 156 weeks
. The issue I am having is that I cannot regress the components towards y because the lengths are different. Do I need to transform Y in some way to get it to fit into the number of rows as components? It is very hard to find an answer for this. A lot of PCR explanations just say to regress components onto y but Y is not in the pca.
Appreciate any help on this!
Usually you do it like this, we can use the iris dataset, and let's make Sepal.Length the dependent, and others independent variable.
First of all, there's correlation between the dependent Petal.Width and Petal.Length:
cor(iris[,2:4])
Sepal.Width Petal.Length Petal.Width
Sepal.Width 1.0000000 -0.4284401 -0.3661259
Petal.Length -0.4284401 1.0000000 0.9628654
Petal.Width -0.3661259 0.9628654 1.0000000
Like you said, if we do regression, we see one of them go negative:
summary(lm(Sepal.Length ~ .,data=iris[,1:4]))
Call:
lm(formula = Sepal.Length ~ ., data = iris[, 1:4])
Residuals:
Min 1Q Median 3Q Max
-0.82816 -0.21989 0.01875 0.19709 0.84570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.85600 0.25078 7.401 9.85e-12 ***
Sepal.Width 0.65084 0.06665 9.765 < 2e-16 ***
Petal.Length 0.70913 0.05672 12.502 < 2e-16 ***
Petal.Width -0.55648 0.12755 -4.363 2.41e-05 ***
We do a PCA, and get the principal components, which is under the $x
:
pca=prcomp(iris[,2:4])
cor(iris[,"Sepal.Length"],pca$x)
PC1 PC2 PC3
[1,] 0.8619141 -0.279587 0.1937703
data = data.frame(
Sepal.Length=iris[,"Sepal.Length"],
pca$x)
summary(lm(Sepal.Length ~ .,data=data))
Call:
lm(formula = Sepal.Length ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-0.82816 -0.21989 0.01875 0.19709 0.84570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.02568 227.519 < 2e-16 ***
PC1 0.37123 0.01340 27.697 < 2e-16 ***
PC2 -0.58457 0.06506 -8.984 1.22e-15 ***
PC3 0.86983 0.13969 6.227 4.80e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The PC components are not correlated and you can use them for regression. If you have a lot of variables, you can also choose by correlation with the target variable as above.