Bag of Visual Words (obtained from features) for CBIR. Steps?

I'm very confused about the steps to follow to use BOVW for CBIR. I found a lot of literature about classification, machine learning and SVM but it is not quite what I'm looking for.
My problem is related to searching image similarity in a database with an image query.

My steps until now:

extract features (example: ORB, BRISK, SIFT...).
store all images' features to disk.
read features and calculate K-means in order to obtain centroids (my vocabulary, right?)

And now I'm stuck. I found many different ways to proceed.

This is my hypothesis:

for each k-means compute nearest neighbour (FLANN?)
Build histogram with set of nearest neighbour

Do I have to extract a dictionary also for every single image and then indexing the images?
Why is vector quantization (step 4. and 5.) necessary?

Can you suggest me a possible way to proceed, or any article, tutorial on the topic?

NOTE: For the implementation of BOVW I cannot use OpenCV because it does not work with binary descriptors so I need to try with sklearn library.

Solution

Ok, this is pretty much what I was looking for:

https://stackoverflow.com/a/8549874/8894489

Hope that can be helpful for someone.