Using OpenCV 3.1 I've calculated the SIFT descriptors for an batch of images.
Each descriptor has a shape (x, 128)
and I've used the pickle based .tofile
function to write each descriptor to disk. In a sample of the images x is between 2000 and 3000
I'm hoping to make use of Apache Spark's kmeans clustering via pyspark but my question is 2 parts.
My interest is in what the sequence would look like for python 2 code assuming there is some common storage between the descriptor generation code and the clustering environment
Is pickling the best way to transfer the descriptor data?
best is very specific here. You could try pickle or protobuf.
How do I get from the bunch of pickle files to a cluster ready dataset?
For example, the LOPQ guys, do:
C0 = KMeans.train(first, V, initializationMode='random', maxIterations=10, seed=seed)
where first
is the RDD I am mentioning, V
is the number of clusters and C0
the computed cluster (check it at line 67 in GitHub).