PCA for KNN in numpy

I've been tasked to implement my PCA code to convert data to a 2d field for a KNN assignment. My PCA code creates an array with the eigenvectors called PCevecs.

def __PCA(data):
   #Normalize data
   data_cent = data-np.mean(data)

   #calculate covariance
   covarianceMatrix = np.cov(data_cent, bias=True)

   #Find eigenvector and eigenvalue
   eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)

   #Sorting the eigenvectors and eigenvalues:
   PCevals = eigenvalue[::-1]
   PCevecs = eigenvector[:,::-1]

   return PCevals, PCevecs

The assignment transforms the training-data using the PCA. The returned PCevecs has the shape (88, 88) given by calling print(PCevecs.shape). The shape of the training data is (88, 4).

np.dot(trainingFeatures, PCevecs[:, 0:2])

When the code is running I get the error message "ValueError: shapes (88,4) and (88,2) not alligned: 4 (dim 1) != 88 (dim 0)". I can see that the arrays don't match, but I can't see that I've done anything wrong with the PCA implementation. I've tried to have a look at similar problems on Stackoverflow. I haven't seen anyone sorting the eigenvector and eigenvalues the same way.

Solution

(EDITED with additional info from the comments)

While the PCA implementation is OK in general, you may want to either compute it on the data transposed, or you want to make sure that you tell np.cov() in which axis your dimensionality is via the rowvar parameter.

The following would work as you expect:

import numpy as np


def __PCA_fixed(data, rowvar=False):
   # Normalize data
   data_cent = data - np.mean(data)

   # calculate covariance (pass `rowvar` to `np.cov()`)
   covarianceMatrix = np.cov(data_cent, rowvar=rowvar, bias=True)  
   # Find eigenvector and eigenvalue
   eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)

   # Sorting the eigenvectors and eigenvalues:
   PCevals = eigenvalue[::-1]
   PCevecs = eigenvector[:,::-1]

   return PCevals, PCevecs

Testing it out with some random numbers:

data = np.random.randint(0, 100, (100, 10))
PCevals, PCevecs = __PCA_fixed(data)
print(PCevecs.shape)
# (10, 10)

Also note that, in more general terms, the singular value decomposition (np.linalg.svd() in NumPy) might be a better approach for principal component analysis (with a simple relationship with the eigenvalue decomposition you use and transposition).

As a general coding style note, it may be a good idea to follow the advice from PEP-8, many of which can be readily checked by some automated tool like, e.g. autopep8.