Search code examples
pythonhamming-distance

Quickest way to find smallest Hamming distance in a list of fixed-length hexes


I'm using Imagehash in Python to generate 48-digit hexadecimal hashes of around 30,000 images, which I'm storing in a list of dictionaries (the phashes as well as some other image properties). For example:

[{"name":"name1", "phash":"a12a5e81127d890a7c91897edc752b506657233f56c594b7e6575e24e457d465"},
 {"name":"name2", "phash":"a1aa7e011367812a7c9181be9975a9e86657239f3ec09697e6565a24e50bf477"}
 ...
 {"name":"name30000", "phash":"a1aa7e05136f810afc9181ba9951a9686617239f3ec4d497e6765a04e52bfc77"}]

I then have video input from a Raspberry Pi which is phashed, and that hash is compared to this database (given the nature of the Pi camera, the test hash from the video stream will NOT ever match the hashes in the database). Right now I'm doing a dumb loop, which takes about 5 seconds to loop through and check the Hamming distance of each of the ~30,000 pre-calculated hashes, which is way too slow. The Imagehash library I'm using means that the Hamming distance can be calculated simply by doing dbHash1 - testHash. Apparently sorting and doing bisect is not the way to approach this, since sorting is irrelevant to Hamming distances. So, I assume there must be a faster way to get this done? I've read this question regarding metric spaces, but I wanted to check if there's a (relatively) simple Python implementation that someone knows of.


Solution

  • I got an answer from the guy behind ImageHash, Johannes Buchner.

    I can store the DB as a 2d matrix:

    arr = []
    for dbHash in db:
        arr.append(dbHash.hash.flatten())
    arr = numpy.array(arr)
    

    Then I can do the comparison against all at the same time:

    binarydiff = arr != testhash.hash.reshape((1,-1))
    hammingdiff = binarydiff.sum(axis=1)
    closestdbHash_i = numpy.argmin(hammingdiff)
    closestdbHash = db[closestdbHash_i]