Search code examples
pythondjangodjango-querysetdjango-ormhamming-distance

Custom comparisons with Django (hamming distance)


I have the following code that allows me to find images of equal has (identical), but say I wanted to just find images with a hamming distance under a certain number, can that be incorporated into django querysets, or raw sql somehow? I don't want to fetch everything and compare with python because that's very very slow and I many many images.

Current code:

def duplicates(request):
    duplicate_images = []
    images = Image.objects.all()
    for image in images:
        duplicates = Image.objects.filter(hash=image.hash).exclude(pk=image.pk)
        for duplicate in duplicates:
            duplicate_images.append([image, duplicate])
        if len(duplicate_images) > 1000:
            break

Solution

  • Here is how to implement this using a postgres extension:

    https://github.com/eulerto/pg_similarity

    Installation:

    $ git clone https://github.com/eulerto/pg_similarity.git
    $ cd pg_similarity
    $ USE_PGXS=1 make
    $ USE_PGXS=1 make install
    $ psql mydb
    psql (9.3.5)
    Type "help" for help.
    
    mydb=# CREATE EXTENSION pg_similarity;
    CREATE EXTENSION
    

    No you can make a django queryset with a custom "WHERE" clause in order to use the hamming_text function

    image = Image.objects.get(pk=1252) # the image you want to compare to
    similar = Image.objects.extra(where=['hamming_text(hash,%s)>=0.88'],
                                  params=[image.hash])
    

    and voila, it works!

    note: the hamming distance here is automatically normalized so 0 means completely different and 1 means identical.