Search code examples
algorithmnumpypcadata-analysis

Why does principal component analysis give me drastically different results?


The problem I am trying to solve is like this: given a blob in an image, I want to get its orientation to guide me draw lines to fill the area. I want the lines to follow the long axis of the area, to use as few lines as possible.

I googled around, and come across PCA (principal component analysis) as a good method to get orientation of blob, by feeding PCA algorithm all the point coordinates:

https://alyssaq.github.io/alyssaq.github.io/2015/computing-the-axes-or-orientation-of-a-blob/index.html

But copying the exact algorithm, I get very surprising results. Given similar shaped areas, the PCA algorithm returns completely different eigenvectors. They seem perpendicular:

Different directions

The lines above are rendered by following the slope given by PCA algorithm, with slight random variation.

I don't know much about data processing, what is happening here? How should I address this problem?

Code:

import numpy as np

# I tried passing different set of points to pca:
# 1. Only points at the perimeter of the area
# 2. Random sample of points within the area
# 3. All points in the area

# points are like [(x1, y1), (x2, y2), ... ]
def pca(points):
    xs, ys = zip(*points)
    x = np.array(xs)
    y = np.array(ys)

    # Subtract mean from each dimension. We now have our 2xm matrix.
    x = x - np.mean(x)
    y = y - np.mean(y)
    coords = np.vstack([x, y])

    # Covariance matrix and its eigenvectors and eigenvalues
    cov = np.cov(coords)
    vals, evecs = np.linalg.eig(cov)

    # Sort eigenvalues in decreasing order (we only have 2 values)
    sort_indices = np.argsort(vals)[::-1]

    evec1, evec2 = evecs[:, sort_indices]  # Eigenvector with largest eigenvalue
    eval1, eval2 = vals[sort_indices[0]], vals[sort_indices[1]]
    print("PCA:", evec1, evec2, eval1, eval2)
    return evec1, evec2, eval1, eval2

I tried passing different set of points to pca:

  1. Only points at the perimeter of the area
  2. Random sample of points within the area
  3. All points in the area

But it doesn't matter, with each of the point selection, my algorithm can produce the above 2 patterns. Although it seems the one on the right (the incorrect one) are produced more often.


Solution

  • The mistake is in this line:

    evec1, evec2 = evecs[:, sort_indices]
    

    The eigenvectors are in the columns, but that assignment assigns the first row of evecs[:, sort_indices] to evec1 and the second row to evec2. A quick fix is

    evec1, evec2 = evecs[:, sort_indices].T