I'm confused with sklearn
's PCA
(here is the documentation), and its relation with Singular Value Decomposition (SVD).
In Wikipedia we have,
The full principal components decomposition of X can, therefore, be given as T=WX, where W is a p-by-p matrix of weights whose columns are the eigenvectors of $X^T X$. The transpose of W is sometimes called the whitening or sphering transformation.
Later once it explains the relationship with SVD, we have:
X=U $\Sigma W^T$
So I assume that matrix W, embeds samples into latent space (which makes sense noting the dimension of the matrices) and using transform
module of the class PCA
in sklearn
should give the same result as if I was multiplying observation matrix by W. However, I checked them and they don't match.
Is there anything wrong that I'm missing or there's a bug in the code?
import numpy as np
from sklearn.decomposition import PCA
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
pca = PCA().fit(x)
# transformed version based on WIKI: t = [email protected] = [email protected](s)
t_svd1= [email protected]
t_svd2= [email protected](s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # should be a small value, but it's not :(
print(np.abs(t_svd2-t_pca).max()) # should be a small value, but it's not :(
There is a difference between the theoretical Wikipedia description and the practical sklearn
implementation, but it is not a bug, merely just a stability and reproducibility enhancement.
You have almost pretty much nailed the exact implementation of the PCA, however in order to be able to fully reproduce the computation, sklearn
developers added one more enforcement to their implementation. The problem stems from the indeterministic nature of SVD, i.e. the SVD does not have a unique solution. That can be easily seen from your equation as well by setting U_s = -U
and W_s = -W
, then U_s
and W_s
also satisfy:
X=U_s $\Sigma W_s^T$
More importantly this holds also when switching the signs of columns of U
and W
. If we just reverse the signs of k-th column of U
and W
, the equality will still hold. You can read more about this issue f.e. here https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2007/076422.pdf.
The implementation of PCA deals with this problem by enforcing the highest loading values in absolute values to be always positive, specifically the method sklearn.utils.extmath.svd_flip
is being used. This way, no matter which sign the resulting vectors have from the indeterministic method np.linalg.svd
, the loading values in absolutes will remain the same, i.e. the signs of the matrices will remain the same.
Thus in order for your code to have the same result as the PCA
implementation:
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(41)
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
vh *= signs.reshape(-1,1)
pca = PCA().fit(x)
# transformed version based on WIKI: t = [email protected] = [email protected](s)
t_svd1= [email protected]
t_svd2= [email protected](s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # pretty small value :)
print(np.abs(t_svd2-t_pca).max()) # pretty small value :)