i want to classify data by distance from centroids in python

I'm making an image classifier that will tell if an image is a car or not, in Python.

here are my steps:

Get SIFT descriptors from about 200 images with cars on them.
On all those SIFT descriptors use k-means algorithm and find about 50 centroids.
Using those centroids and new images generate train data for SVM.

I want to find those k-mean centroids only once and then save them in file for reuse.

My problem is following:

I have 50 precalculated centroids. I have new image with SIFT descriptors. I want to find nearest centroids for each descriptor.

for example: centroid 1 is nearest to 5 descriptors, centroid 2 is nearest to 12 descriptors and so on. Then I will feed those data to SVM.

It is like kmeans.predict(), but i don't want to calculate k-means every time I add new image.

So is there any function in python where I give 50 points (centroids) in hyperspace, N points in same hyperspace and it will return me distribution of those N points according nearest centroids?

Thanks

Solution

Have a look at the article about model persistence in the scikit-learn documentation: http://scikit-learn.org/stable/modules/model_persistence.html

Save your model using pickle:

import pickle
with open('kmeans.dat', 'w') as f:
    pickle.dump(kmeans, f)

Later you can load it again by using:

with open('kmeans.dat', 'r') as f:
    kmeans = pickle.load(f)

Note that you can only load models which have been stored by the same python version.