I am trying to implement PCA from scratch. Following is the code:
sc = StandardScaler() #standardization
X_new = sc.fit_transform(X)
Z = np.divide(np.dot(X_new.T,X_new),X_new.shape[0]) #covariance matrix
eig_values, eig_vectors = np.linalg.eig(Z) #eigen vectors calculation
eigval_sorted = np.sort(eig_values)[::-1]
ev_index =np.argsort(eigval_sorted)[::-1]
pc = eig_vectors[:,ev_index] #eigen vectors sorts on the basis of eigen values
W = pc[:,0:2] #extracting 2 components
print(W)
and getting the following components:
[[ 0.52237162 -0.37231836]
[-0.26335492 -0.92555649]
[ 0.58125401 -0.02109478]
[ 0.56561105 -0.06541577]]
When I use the sklearn's PCA I get the following two components:
array([[ 0.52237162, -0.26335492, 0.58125401, 0.56561105],
[ 0.37231836, 0.92555649, 0.02109478, 0.06541577]])
Projection onto new feature space gives following different figures:
Where am I doing it wrong and what can be done to resolve the problem?
The result of a PCA are technically not n vectors, but a subspace of dimension n. This subspace is represented by n vectors that span that subspace.
In your case, while the vectors are different, the spanned subspace is the same, so the result of the PCA is the same.
If you want to align your solution perfectly with the sklearn solution, you need to normalise your solution in the same way. Apparently sklearn prefers positive values over negative values? You'd need to dig into their documentation.
edit: Yes, of course, what I wrote is wrong. The algorithm itself returns ordered orthonormal basis vectors. So vectors that are of length one and orthogonal to each other and they are ordered in their 'importance' to the dataset. So way more information than just the subspace.
However, if v, w, u are a solution of the PCA, so should +/- v, w, u be.
edit: It seems that np.linalg.eig has no mechanism to guarantee it will also return the same set of eigenvectors representing the eigenspace, see also here: NumPy linalg.eig
So, a new version of numpy, or just how the stars are aligned today, can change your result. Although, for a PCA it should only vary in +/-