Search code examples
pythonhashgenerator

hash functions family generator in python


I am looking for a hash functions family generator that could generate a family of hash functions given a set of parameters. I haven't found any such generator so far. Is there a way to do that with the hashlib package ?

For example I'd like to do something like :

h1 = hash_function(1)
h2 = hash_function(2)
...

and h1 and h2 would be different hash functions.

For those of you who might know about it, I am trying to implement a min-hashing algorithm on a very large dataset.

Basically, I have a very large set of features (100 millions to 1 billion) for a given document, and I need to create 1000 to 10000 different random permutations for this set of features.

I do NOT want to build the random permutations explicitly so the technique I would like to use in the following :

  1. generate a hash function h and consider that for two indices r and s
  2. r appears before s in the permutation if h(r) < h(s) and do that for 100 to 1000 different hash functions.

Are there any known libraries that I might have missed ? Or any standard way of generating families of hash functions with python that you might be aware of ?


Solution

  • I'd just do something like (if you don't need thread-safety -- not hard to alter if you DO need thread safety -- and assuming a 32-bit Python version):

    import random
    
    _memomask = {}
    
    def hash_function(n):
      mask = _memomask.get(n)
      if mask is None:
        random.seed(n)
        mask = _memomask[n] = random.getrandbits(32)
      def myhash(x):
        return hash(x) ^ mask
      return myhash