Search code examples
pythonnumpymathregressionpca

2D PCA line fitting with numpy


I'm trying to implement a 2D PCA with numpy. The code is rather simple:

import numpy as np

n=10
d=10
x=np.linspace(0,10,n)
y=x*d

covmat = np.cov([x,y])
print(covmat)

eig_values, eig_vecs = np.linalg.eig(covmat)
largest_index = np.argmax(eig_values)
largest_eig_vec = eig_vecs[largest_index]

The covariance matrix is:

[[   11.31687243   113.16872428]
 [  113.16872428  1131.6872428 ]]

Then I've got a simple helper method to plot a line (as a series of points) around a given center, in a given direction. This is meant to be used by pyplot, therefore I'm preparing separate lists for the x and y coordinate.

def plot_line(center, dir, num_steps, step_size):
    line_x = []
    line_y = []
    for i in range(num_steps):
        dist_from_center = step_size * (i - num_steps / 2)
        point_on_line = center + dist_from_center * dir
        line_x.append(point_on_line[0])
        line_y.append(point_on_line[1])
    return (line_x, line_y)

And finally the plot setup:

lines = []
mean_point=np.array([np.mean(x),np.mean(y)])
lines.append(plot_line(mean_point, largest_eig_vec, 200, 0.5))

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(x,y, c="b", marker=".", s=10
           )
for line in lines:
    ax.plot(line[0], line[1], c="r")

ax.scatter(mean_point[0], mean_point[1], c="y", marker="o", s=20)

plt.axes().set_aspect('equal', 'datalim')
plt.show()

Unfortunately, the PCA doesn't seem to work. Here's the plot:

pca line fitting

I'm afraid I've got no idea what went wrong.

  • I've computed the covariance manually -> same result.
  • I've checked the other eigenvalue -> perpendicular to the red line.
  • I've tested plot_line with the direction (1,10). It's perfectly aligned to my points: perfectly aligned

The final plot shows that the line fitted by pca is the correct result only it is mirrored at the y axis.

In fact, if I change the x coordinate of the eigenvector, the line is fitted perfectly:

perfect fit

Apparently this is a fundamental problem. Somehow I've misunderstood how to use pca.

Where is my mistake ? Online resources seem to describe PCA exactly as I implemented it. I don't believe I have to categorically mirror my line-fits at the y-axis. It's got to be something else.


Solution

  • Your mistake is that you're extracting the last row of the eigenvector array. But the eigenvectors form the columns of the eigenvector array returned by np.linalg.eig, not the rows. From the documentation:

    [...] the arrays a, w, and v satisfy the equations dot(a[:,:], v[:,i]) = w[i] * v[:,i] [for each i]

    where a is the array that np.linalg.eig was applied to, w is the 1d array of eigenvalues, and v is the 2d array of eigenvectors. So the columns v[:, i] are the eigenvectors.

    In this simple two-dimensional case, since the two eigenvectors are mutually orthogonal (because we're starting with a symmetric matrix) and unit length (because np.linalg.eig normalises them that way), the eigenvector array has one of the two forms

    [[ cos(t)  sin(t)]
     [-sin(t)  cos(t)]]
    

    or

    [[ cos(t)  sin(t)]
     [ sin(t) -cos(t)]]
    

    for some real number t, and in the first case, reading the first row (for example) instead of the first column would give [cos(t), sin(t)] in place of [cos(t), -sin(t)]. This explains the apparent reflection that you're seeing.

    Replace the line

    largest_eig_vec = eig_vecs[largest_index]
    

    with

    largest_eig_vec = eig_vecs[:, largest_index]
    

    and you should get the expected results.