Sklearn PCA returning an array with only one value, when given an array of hundreds

I wrote a program intended to classify an image by similarity:

for i in g:
    fulFi = i

    tiva = []
    tivb = []

    a = cv2.imread(i)
    b = cv2.resize(a, (500, 500))

    img2 = flatten_image(b)
    tivb.append(img2)
    cb = np.array(tivb)
    iab = trueArray(cb)

    print "Image:                      " + (str(i)).split("/")[-1]
    print "Image Size                  " + str(len(iab))
    print "Image Data:                 " + str(iab) + "\n"



pca = RandomizedPCA(n_components=2)
X = pca.fit_transform(iab)
Xy = pca.transform(X)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, Xy.ravel())

def aip(img):
    a = cv2.imread(img)
    b = cv2.resize(a, (500, 500))

    tivb = []

    r = flatten_image(b)
    tivb.append(r)
    o = np.array(tivb)
    l = trueArray(o)

    print "Test Image:                 " + (str(img)).split("/")[-1]
    print "Test Image Size             " + str(len(l))
    print "Test Image Data:            " + str(l) + "\n"

    return l


testIm = aip(sys.argv[2])
b = pca.fit_transform(testIm)
print         "KNN Prediction:             " + str(knn.predict(b))

And while it functioned perfectly, it had an error: it gave me the exact same value regardless of the image used:

Image:                      150119131035-green-bay-seattle-nfl-1024x576.jpg
Image Size                  750000
Image Data:                 [255 242 242 ..., 148 204 191]

Test Image:                 agun.jpg
Test Image Size             750000
Test Image Data:            [216 255 253 ..., 205 225 242]

KNN Prediction:             [-255.]

and

Image:                      150119131035-green-bay-seattle-nfl-1024x576.jpg
Image Size                  750000
Image Data:                 [255 242 242 ..., 148 204 191]

Test Image:                 bliss.jpg
Test Image Size             750000
Test Image Data:            [243 240 232 ...,  13  69  48]

KNN Prediction:             [-255.]

The KNN prediction is always 255, no matter the image used. After investigation further, A found that the problem was my PCA: For some reason, it was taking an array with 750000 values and returning an array with only one:

pca = RandomizedPCA(n_components=2)
X = pca.fit_transform(iab)
Xy = pca.transform(X)

print "Iab:                        " + str(iab)
print "Iab Type:                   " + str(type(iab))
print "Iab length:                 " + str(len(iab))



print "X Type:                     " + str(type(X))
print "X length:                   " + str(len(X))
print "X:                          " + str(X)


print "Xy Type:                    " + str(type(Xy))
print "Xy Length:                  " + str(len(X))
print "Xy:                         " + str(Xy)

gives this:

Image:                      150119131035-green-bay-seattle-nfl-1024x576.jpg
Image Size                  750000
Image Data:                 [255 242 242 ..., 148 204 191]

Iab:                        [255 242 242 ..., 148 204 191]
Iab Type:                   <type 'numpy.ndarray'>
Iab length:                 750000
X Type:                     <type 'numpy.ndarray'>
X length:                   1
X:                          [[ 0.]]
Xy Type:                    <type 'numpy.ndarray'>
Xy Length:                  1
Xy:                         [[-255.]]

My question is why? X and Xy should both have hundreds of values, not just one. The tutorial I followed didn't have an explanation, and the documentation only says that there needs to be the same array format for both the transform and the fit_transform. How should I be approaching this?

Solution

What you are doing with X = pca.fit_transform(iab) and Xy = pca.transform(X) is wrong.

You are loosing the iab variable for the two images. You need the flattened array of both images, outside of your for loop. However, after your first iteration, your second iteration overwrites the iab array.
Even if you saved the two arrays separately, as say iab[0] and iab[1], you will need to perform PCA on both and use both images represented along the transformed axes. You need to decide what to use to learn the transformation though.

Here is sample code:

# First initialize the PCA with desired components 
pca = RandomizedPCA(n_components=2)

# Next you need to fit data to learn the transformations
pca.fit(np.vstack(iab[0].shape(1, len(iab[0]), iab[1].shape(1, len(iab[1])))

# Finally you apply this learned transformation on input data
X[0] = pca.transform(iab[0])
X[1] = pca.transform(iab[1])

You basically learn PCA on a matrix. The rows represent each image. What you want to be doing is trying to identify which pixels in the image best describe the image. For this you need to input many images, and find which pixels differentiate between them better than others. In your way of using the fit, you simply input 100s of values in a 1D list, which effectively means, you had one value representing each image, and you had 100s of images.

Also in your case, you combined fit() and transform(), which is a valid use case, if only you understand what it represents. You missed transformation of the second image, regardless.

If you want to know more about how PCA works you can read this answer.

Finally, you cannot learn a KNN classifier on 1 training sample and 1 testing sample! Learning algorithms are meant to learn from a population of input.

All you seem to need is basic distance between the two. You need to pick a distance metric. If you choose to use Euclidean distance (also called the L2 norm), then here is the code for it:

dist = numpy.linalg.norm(X[0]-X[1])

You can also do this instead:

from scipy.spatial import distance
dist = distance.euclidean(X[0], X[1])

In any case, there is no meaning in transforming the transformed data again, as you are doing with Xy = pca.transform(X). That doesn't give you a target.

You can only apply classification such as KNN when you have say, 100 images, where 50 show a "tree" and the remaining 50 show a "car". Once you train the model, you can predict if a new image is of a tree or a car.