python scikit-learn sparse-matrix nearest-neighbor

kNN with big sparse matrices in Python

I have two large sparse matrices:

In [3]: trainX
Out[3]: 
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 286674296 stored elements in Compressed Sparse Row format>

In [4]: testX
Out[4]: 
<2013337x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 95423596 stored elements in Compressed Sparse Row format>

About 5 GB RAM in total to load. Note these matrices are HIGHLY sparse (0.0062% occupied).

For each row in testX, I want to find the Nearest Neighbor in trainX and return its corresponding label, found in trainY. trainY is a list with the same length as trainX and has many many classes. (A class is made up of 1-5 separate labels, each label is one of 20,000, but the number of classes is not relevant to what I am trying to do right now.)

I am using sklearn's KNN algorithm to do this:

from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(trainX, trainY)
clf.predict(testX[0])

Even predicting for 1 item of testX takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible). My laptop with 16GB of RAM starts to swap a bit, but does manage to complete for 1 item in testX.

My questions is, how can I do this so it will finish in reasonable time? Say one night on a large EC2 instance? Would just having more RAM and preventing the swapping speed it up enough (my guess is no). Maybe I can somehow make use of the sparsity to speed up the calculation?

Thank you.

Solution

Classic kNN data structures such as the KD tree used in sklearn become very slow when the dimension of the data increases. For very high-dimensional problems it is advisable to switch algorithm class and use approximate nearest neighbour (ANN) methods, which sklearn seems to be lacking, unfortunately. See links below for papers on algorithms and theory why approximate nearest neighbors is so much faster in these cases.

A well-known ANN library in the C++ world, widely used in Computer Vision for nearest neighbors in feature descriptor spaces, is FLANN. The homepage says it contains Python bindings (I have never worked with then).
Another popular alternative is the ANN library with Python wrapper here, although the newer FLANN seems to be more popular at the moment.
See also this answer (but some links are dead).

One caveat: Your data seems to be very high dimensional - I don't known how these libraries perform for you. They should still beat sklearn.