I am trying to learn principal component regression (pcr) with Matlab. I use this guide here: http://www.mathworks.fr/help/stats/examples/partial-least-squares-regression-and-principal-components-regression.html
it's really good, but I just cannot understand one step:
we do the PCA and the regression, nice and clear:
[PCALoadings,PCAScores,PCAVar] = princomp(X);
betaPCR = regress(y-mean(y), PCAScores(:,1:2));
And then we adjust the first coefficient:
betaPCR = PCALoadings(:,1:2)*betaPCR;
betaPCR = [mean(y) - mean(X)*betaPCR; betaPCR];
yfitPCR = [ones(n,1) X]*betaPCR;
How come that the coefficient needs to be 'mean(y) - mean(X)*betaPCR'
for the constant one factor? Can you explain that to me?
Thanks in advance!
This is really a math question, not a coding question. Your PCA extracts a set of features and puts them in a matrix, which gives you PCALoadings
and PCAScores
. Pull out the first two principal components and their loadings, and put them in their own matrix:
W = PCALoadings(:, 1:2)
Z = PCAScores(:, 1:2)
The relationship between X
and Z
is that X
can be approximated by:
Z = (X - mean(X)) * W <=> X ~ mean(X) + Z * W' (1)
The intuition is that Z
captures most of the "important information" in X
, and the matrix W
tells you how to transform between the two representations.
Now you can do a regression of y
on Z
. First you have to subtract the mean from y
, so that both the left and right hand sides have mean zero:
y - mean(y) = Z * beta + errors (2)
Now you want to use that regression to make predictions for y
from X
. Substituting from equation (1) into equation (2) gives you
y - mean(y) = (X - mean(X)) * W * beta
= (X - mean(X)) * beta1
where we have defined beta1 = W * beta
(you do this in your third line of code). Rearranging:
y = mean(y) - mean(X) * beta1 + X * beta1
= [ones(n,1) X] * [mean(y) - mean(X) * beta1; beta1]
= [ones(n,1) X] * betaPCR
which works out if we define
betaPCR = [mean(y) - mean(X) * beta1; beta1]
as in your fourth line of code.