Image hashing in c++, with similar images feature

I'm developing a memoization system for similar images. I will divide this question in two subquestions since one is the next step of the other, but I could post two different questions if you think that's more appropriate.

First thing to know: I don't know nothing about image processing, so please be gentle with this poor guy :)

PROBLEM DESCRIPTION

We have a function ReturnType foo(Image) that takes an image, make some time expensive computation on it and it returns something (which depends on the application). The memoizator that I'm designing is an unordered_map<ImageHash,ReturnType> (or an equivalent structure), so if the user submit the same image twice, it returns directly the already computed ReturnType value.

WHAT I NEED

As you can imagine, I need some HashFunction s.t. HashFunction(Image)=ImageHash, where ImageHash is unique with high probability.

Notice that this ImageHash must not depend on the particular run, machine, time. This because the unordered_map will be serialized on file (so it can be used in different runs) and by shared with other users.

Since performance is really crucial, a fast hash function would be good.

I found this question about the topic, but the author introduces a lot of constraints on the image (and in addiction no good solution is given).

Note: There is no constraint on the image, so you can propose any solution you prefer (include the set of image that it's designed for).

Note: could be SHA-1 a possible solution? I used it only for strings, I don't know if it possible to use it for images (and if it exists a C++ implementation)

THE NEXT STEP

I would like to extend the previous solution so we return the same result for similar images. So formally, given Image image1 similar to Image image2, then the system returns ReturnType result for image1 if (image1,result) OR (image2,result) were already previously computed.

I've heard about phash but I don't know if it's suitable for this purpose.

Solution

i suggest you first start to gather some images before proceeding any further. Having said that, the best way to do it currently is to learn a similarity function using deep learning and map the image into some n-dimensional feature space and use cosine distance to measure similarity. Here's some example code to get you started (https://github.com/kevinlin311tw/caffe-cvprw15). If you want a more performant technique and is willing to follow the rabbit hole, look into triplet ranking loss.

Phash works yes but it's performance in producing a similarity score is far below that of using deep learning features. However, it definitely better than real hashing techniques because just a change of jpeg compression level is going to change the hash value. If you don't want to spend too much time on this, phash would be the best possible alternative as it takes no effort to use.