Search code examples

i want to classify data by distance from centroids in python

I'm making an image classifier that will tell if an image is a car or not, in Python.

here are my steps:

  1. Get SIFT descriptors from about 200 images with cars on them.
  2. On all those SIFT descriptors use k-means algorithm and find about 50 centroids.
  3. Using those centroids and new images generate train data for SVM.

I want to find those k-mean centroids only once and then save them in file for reuse.

My problem is following:

I have 50 precalculated centroids. I have new image with SIFT descriptors. I want to find nearest centroids for each descriptor.

for example: centroid 1 is nearest to 5 descriptors, centroid 2 is nearest to 12 descriptors and so on. Then I will feed those data to SVM.

It is like kmeans.predict(), but i don't want to calculate k-means every time I add new image.

So is there any function in python where I give 50 points (centroids) in hyperspace, N points in same hyperspace and it will return me distribution of those N points according nearest centroids?



  • Have a look at the article about model persistence in the scikit-learn documentation:

    Save your model using pickle:

    import pickle
    with open('kmeans.dat', 'w') as f:
        pickle.dump(kmeans, f)

    Later you can load it again by using:

    with open('kmeans.dat', 'r') as f:
        kmeans = pickle.load(f)

    Note that you can only load models which have been stored by the same python version.