How should I write the code scikit-learn PCA .transform()
method by using its .components
?
I thought the PCA .transform()
method transforms a 3D point to 2D Point by just applying a matrix M
to the 3D point P
like below:
np.dot(M, P)
To ensure this is correct, I wrote the following code.
But, the result was, I couldn’t make the same result of the PCA .transform()
method.
How should I modify the code? Am I missing something?
from sklearn.decomposition import PCA
import numpy as np
data3d = np.arange(10*3).reshape(10, 3) ** 2
pca = PCA(n_components=2)
pca.fit(data3d)
pca_transformed2d = pca.transform(data3d)
sample_index = 0
sample3d = data3d[sample_index]
# Manually transform `sample3d` to 2 dimensions.
w11, w12, w13 = pca.components_[0]
w21, w22, w23 = pca.components_[1]
my_transformed2d = np.zeros(2)
my_transformed2d[0] = w11 * sample3d[0] + w12 * sample3d[1] + w13 * sample3d[2]
my_transformed2d[1] = w21 * sample3d[0] + w22 * sample3d[1] + w23 * sample3d[2]
print("================ Validation ================")
print("pca_transformed2d:", pca_transformed2d[sample_index])
print("my_transformed2d:", my_transformed2d)
if np.all(my_transformed2d == pca_transformed2d[sample_index]):
print("My transformation is correct!")
else:
print("My transformation is not correct...")
Output:
================ Validation ================
pca_transformed2d: [-492.36557212 12.28386702]
my_transformed2d: [ 3.03163093 -2.67255444]
My transformation is not correct...
PCA begins with centering the data: subtracting the average of all observations. In this case, centering is done with
centered_data = data3d - data3d.mean(axis=0)
Averaging out along axis=0 (rows) means only one row will be left, with three components of the mean. After centering, multiply the data by the PCA components; but instead of writing out matrix multiplication by hand, I'd use .dot
:
my_transformed2d = pca.components_.dot(centered_data[sample_index])
Finally, verification. Don't use ==
between floating point numbers; exact equality is rare. Tiny discrepancies appear because of a different order of operations somewhere: for example,
0.1 + 0.2 - 0.3 == 0.1 - 0.3 + 0.2
is False. This is why we have np.allclose
, which says "they are close enough".
if np.allclose(my_transformed2d, pca_transformed2d[sample_index]):
print("My transformation is correct!")
else:
print("My transformation is not correct...")