Search code examples
rregressionpca

Principal Component Regression? What is the dependent variable?


I am performing a PCA to try to flush out the highly correlated variables real coefficients. I have a very large dataset but will try to simplify here. I have the formula:

lm(y~x1+x2+x3...x55) -> reg_linear_model

The issue I am having is that x1:x4 are all very highly correlated and some of them are coming in negative because of this. When I try I perform I get the list of components and their values. I would like to to test which components to use but the dependent Y is three years of data broken up by week so it is y1, y2, y3, y4, ....y156. 156 weeks. The issue I am having is that I cannot regress the components towards y because the lengths are different. Do I need to transform Y in some way to get it to fit into the number of rows as components? It is very hard to find an answer for this. A lot of PCR explanations just say to regress components onto y but Y is not in the .

Appreciate any help on this!


Solution

  • Usually you do it like this, we can use the iris dataset, and let's make Sepal.Length the dependent, and others independent variable.

    First of all, there's correlation between the dependent Petal.Width and Petal.Length:

    cor(iris[,2:4])
                 Sepal.Width Petal.Length Petal.Width
    Sepal.Width    1.0000000   -0.4284401  -0.3661259
    Petal.Length  -0.4284401    1.0000000   0.9628654
    Petal.Width   -0.3661259    0.9628654   1.0000000
    

    Like you said, if we do regression, we see one of them go negative:

    summary(lm(Sepal.Length ~ .,data=iris[,1:4]))
    
    Call:
    lm(formula = Sepal.Length ~ ., data = iris[, 1:4])
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.82816 -0.21989  0.01875  0.19709  0.84570 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
    Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
    Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
    Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
    

    We do a PCA, and get the principal components, which is under the $x:

    pca=prcomp(iris[,2:4])
    cor(iris[,"Sepal.Length"],pca$x)
               PC1       PC2       PC3
    [1,] 0.8619141 -0.279587 0.1937703
    
    data = data.frame(
    Sepal.Length=iris[,"Sepal.Length"],
    pca$x)
    
    summary(lm(Sepal.Length ~ .,data=data))
    
    Call:
    lm(formula = Sepal.Length ~ ., data = data)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.82816 -0.21989  0.01875  0.19709  0.84570 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  5.84333    0.02568 227.519  < 2e-16 ***
    PC1          0.37123    0.01340  27.697  < 2e-16 ***
    PC2         -0.58457    0.06506  -8.984 1.22e-15 ***
    PC3          0.86983    0.13969   6.227 4.80e-09 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

    The PC components are not correlated and you can use them for regression. If you have a lot of variables, you can also choose by correlation with the target variable as above.