apache-spark pyspark distributed-computing sift bigdata

How can I cluster SIFT descriptors with Apache Spark kmeans (via pickle or not)

Using OpenCV 3.1 I've calculated the SIFT descriptors for an batch of images. Each descriptor has a shape (x, 128) and I've used the pickle based .tofile function to write each descriptor to disk. In a sample of the images x is between 2000 and 3000

I'm hoping to make use of Apache Spark's kmeans clustering via pyspark but my question is 2 parts.

is pickling the best way to transfer the descriptor data
how do I get from the bunch of pickle files to a cluster ready dataset and what pitfalls should I be aware of (Spark, pickling, SIFT)

My interest is in what the sequence would look like for python 2 code assuming there is some common storage between the descriptor generation code and the clustering environment

Solution

Is pickling the best way to transfer the descriptor data?

best is very specific here. You could try pickle or protobuf.

How do I get from the bunch of pickle files to a cluster ready dataset?

Deserialize your data.
Create an RDD, that will wold the vectors (i.e. every element of the RDD will be a feature, a 128 dimensional vector)).
Cache the RDD, since kMeans will use it again and again.
Train the kMeans model, to get your cluster.

For example, the LOPQ guys, do:

C0 = KMeans.train(first, V, initializationMode='random', maxIterations=10, seed=seed)

where first is the RDD I am mentioning, V is the number of clusters and C0 the computed cluster (check it at line 67 in GitHub).

Unpersist your RDD.