Search code examples
rregressionpcamultivariate-testingpls

Comparing all variables in PCR function from pls package, R


I am trying to conduct a Principal Components Regression Analysis (PCR) in R. Usually I would do a PCA (Principal Components Analysis), however I have multi-collinearity and have read that PCR can handle this.

I am using the pcrfunction from the pls package. This requires a formula to identify the variables to be compared. I want to be able to compare every variable against every other variable, the way a PCA does. However in this function I can only figure out how to compare one variable against every other variable, and depending on which variable I choose, the result changes. Of course, it is possible I am not understanding PCR correctly.

Here is an example using the iris data set.

library(pls)
library(ggplot2)

Comparing Petal.Length to all other variables:

ir.pcr<-pcr(Petal.Length~ ., data = iris, validation = "CV")#PCR comparing `Petal.Length` with all other variables

df<-data.frame(ir.pcr$scores[,1],ir.pcr$scores[,2])#get first 2 COMP scores from PCR for ggplot
colnames(df)<-c('Comp1', 'Comp2')
   
ggplot(data=df,aes(x=Comp1,y=Comp2)) + 
  geom_point(aes(fill=iris$Species),shape=21,colour='black',size=3)#plot points

enter image description here

Using Sepal.Width compared to every other variable:

ir.pcr<-pcr(Sepal.Width~ ., data = iris, validation = "CV")#PCR

df<-data.frame(ir.pcr$scores[,1],ir.pcr$scores[,2])#get first 2 COMP scores from PCR for ggplot
colnames(df)<-c('Comp1', 'Comp2')

ggplot(data=df,aes(x=Comp1,y=Comp2)) + 
  geom_point(aes(fill=iris$Species),shape=21,colour='black',size=3)#plot points

enter image description here

My understanding is that including a . after ~ in a formula means 'compare to everything else'. If this is so, then how can I essentially have .~. to be able to compare every variable to every other variable?


Solution

  • PCR is principal components regression. That means that you have one dependent variable (on a left hand side of ~) and many independent variables (right hand side of ~) just like in linear regression.

    PCR first conducts PCA on independent variables only, and then regresses dependent variable on principal components from PCA. That is why you get different results when chosing different dependent variables.

    This helps dealing with multicollinearity of independent variables only. So this technique is useful when you'd like to run linear regression but you have multicollinearity issue with independent variables. It is not useful in dimension reduction task (e.g. when you don't have specified dependent variable), as PCA is.