Search code examples
algorithmsearchimage-processingsearch-enginesimilarity

Visual similarity search algorithm


I'm trying to build a utility like this http://labs.ideeinc.com/multicolr, but I don't know which algorithm they are using, Does anyone know?


Solution

  • All they are doing is matching histograms.

    So build a histogram for your images. Normalize the histograms by size of image. A histogram is a vector with as many elements as colors. You don't need 32,24, and maybe not even 16 bits of accuracy and this will just slow you down. For performance reasons, I would map the histograms down to 4, 8, and 10-12 bits.

    • Do a fuzzy least distance compare between the all the 4 bit histograms and your sample colors.
    • Then take that set and do the 8 bit histogram compare.
    • Then maybe go up to a 10 or 12 bit histogram compare with the remaining set. This will be the highest performance search, because you are comparing the total set with a very small number of calculations, to find a small subset.
    • Then you work on the small subset with a higher number of calculations, etc.

    The real big trick is to find the best algorithm for matching similar histograms.

    • Start with the distance calculation. In 3 dimensions i think it was:

      SQRT((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)

    I'm doing this from memory, so look it up to make sure.

    • For your purposes, you will have more than 3 dimensions, so you will have more terms. A 4 bit histogram would have 16 terms, an 8 bit one would have 256 terms, etc. Remember that this kind of math is slow, so don't actually do the SQRT part. If you normalize the size of your images small enough, say down to 10,000 pixels, then you know you only will ever have to do x^2 for values 0..10,0000. Pre-calculate a lookup table of x^2 where x goes from 0..10,000. Then your calculations will go fast.

    • When you select a color from the palette, just make a histogram with that color = 10,0000. When select 2, make a histogram with color1=5000, color2=5000 etc.

    • In the end you will have to add in fudge factors to make the application match the real world, but you will find these with testing.