Search code examples
apache-sparkpysparkdistributed-computingsiftbigdata

How can I cluster SIFT descriptors with Apache Spark kmeans (via pickle or not)


Using OpenCV 3.1 I've calculated the SIFT descriptors for an batch of images. Each descriptor has a shape (x, 128) and I've used the pickle based .tofile function to write each descriptor to disk. In a sample of the images x is between 2000 and 3000

I'm hoping to make use of Apache Spark's kmeans clustering via pyspark but my question is 2 parts.

  1. is pickling the best way to transfer the descriptor data
  2. how do I get from the bunch of pickle files to a cluster ready dataset and what pitfalls should I be aware of (Spark, pickling, SIFT)

My interest is in what the sequence would look like for python 2 code assuming there is some common storage between the descriptor generation code and the clustering environment


Solution

  • Is pickling the best way to transfer the descriptor data?

    best is very specific here. You could try pickle or protobuf.

    How do I get from the bunch of pickle files to a cluster ready dataset?

    1. Deserialize your data.
    2. Create an RDD, that will wold the vectors (i.e. every element of the RDD will be a feature, a 128 dimensional vector)).
    3. Cache the RDD, since kMeans will use it again and again.
    4. Train the kMeans model, to get your cluster.

    For example, the LOPQ guys, do:

    C0 = KMeans.train(first, V, initializationMode='random', maxIterations=10, seed=seed)
    

    where first is the RDD I am mentioning, V is the number of clusters and C0 the computed cluster (check it at line 67 in GitHub).

    1. Unpersist your RDD.