Search code examples
pythoncperfect-hash

Create Minimum perfect hash for sparse 64-bit unsigned integer


I need a 64 bit to 16 bit perfect hash function for a sparsely populated list of keys.

I have a dictionary in python which has 48326 keys of length 64-bits. I would like to create a minimum perfect hash for this list of keys. (I don't want to have to wait for a few days to calculate the MPH so i am ok with it mapping to a 16 bit hash also)

The objective is to eventually port this dictionary to C as an array which contains the dict values and the index is calculated by the Minimum perfect hash function taking key as input . I cannot use external hashing libraries in the C port of the application I am building

Question: Is there any python library which will take my keys as input and provide me the hashing parameters and (based on a defined algorithm used for hashing) as an output.

I found a library perfection 2.0.0 but since my keys are of 64 bit form, this just hung. (even when I tested it on a subset of 2000 keys)

EDIT As suggested in comments I looked at Steve Hanov's Algo and modified the hash function to take a 64 bit integer (changing values of the FNV Prime and offset as per this wiki page)

while I got the result, unfortunately the Maps return -ve index values while i can make it work it means i have to add another 4 cycles to the hash calculations by checking for -ve index

would like to avoid this


Solution

  • Personally, I'd just generate a table with gperf, or for a large number of keys, with CMPH, and be done with it.

    If you must do this in Python, then I found this blog post with some Python 2 code that is very efficient at turning string keys into a minimal perfect hash, using an intermediary table.

    Adapting the code in the post to your requirements, produces a minimal perfect hash for 50k items in under 0.35 seconds:

    >>> import random
    >>> testdata = {random.randrange(2**64): random.randrange(2**64)
    ...             for __ in range(50000)}  # 50k random 64-bit keys
    >>> import timeit
    >>> timeit.timeit('gen_minimal_perfect_hash(testdata)', 'from __main__ import  gen_minimal_perfect_hash, testdata', number=10)
    3.461486832005903
    

    The changes I made:

    • I switched to Python 3, followed the Python styleguide and make the code more Pythonic
    • I am turning 64-bit unsigned integers into 8-byte bytestrings with int.to_bytes()
    • Instead of storing negative numbers, I am using a flag to distinguish between direct and hash-input values in the intermediary table

    The adapted code:

    # Easy Perfect Minimal Hashing
    # By Steve Hanov. Released to the public domain.
    # Adapted to Python 3 best practices and 64-bit integer keys by Martijn Pieters
    #
    # Based on:
    # Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud,
    # "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
    # also a good reference:
    # Compress, Hash, and Displace algorithm by Djamal Belazzougui,
    # Fabiano C. Botelho, and Martin Dietzfelbinger
    from itertools import count, groupby
    
    
    def fnv_hash_int(value, size, d=0x811c9dc5):
        """Calculates a distinct hash function for a given 64-bit integer.
    
        Each value of the integer d results in a different hash value. The return
        value is the modulus of the hash and size.
    
        """
        # Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/
        # The unsigned integer is first converted to a 8-character byte string.
        for c in value.to_bytes(8, 'big'):
            d = ((d * 0x01000193) ^ c) & 0xffffffff
    
        return d % size
    
    
    def gen_minimal_perfect_hash(dictionary, _hash_func=fnv_hash_int):
        """Computes a minimal perfect hash table using the given Python dictionary.
    
        It returns a tuple (intermediate, values). intermediate and values are both
        lists. intermediate contains the intermediate table of indices needed to
        compute the index of the value in values; a tuple of (flag, d) is stored, where
        d is either a direct index, or the input for another call to the hash function.
        values contains the values of the dictionary.
    
        """
        size = len(dictionary)
    
        # Step 1: Place all of the keys into buckets
        buckets = [[] for __ in dictionary]
        intermediate = [(False, 0)] * size
        values = [None] * size
    
        for key in dictionary:
            buckets[_hash_func(key, size)].append(key)
    
        # Step 2: Sort the buckets and process the ones with the most items first.
        buckets.sort(key=len, reverse=True)
        # Only look at buckets of length greater than 1 first; partitioned produces
        # groups of buckets of lengths > 1, then those of length 1, then the empty
        # buckets (we ignore the last group).
        partitioned = (g for k, g in groupby(buckets, key=lambda b: len(b) != 1))
        for bucket in next(partitioned, ()):
            # Try increasing values of d until we find a hash function
            # that places all items in this bucket into free slots
            for d in count(1):
                slots = {}
                for key in bucket:
                    slot = _hash_func(key, size, d=d)
                    if values[slot] is not None or slot in slots:
                        break
                    slots[slot] = dictionary[key]
                else:
                    # all slots filled, update the values table; False indicates
                    # these values are inputs into the hash function
                    intermediate[_hash_func(bucket[0], size)] = (False, d)
                    for slot, value in slots.items():
                        values[slot] = value
                    break
    
        # The next group is buckets with only 1 item. Process them more quickly by
        # directly placing them into a free slot.
        freelist = (i for i, value in enumerate(values) if value is None)
    
        for bucket, slot in zip(next(partitioned, ()), freelist):
            # These are 'direct' slot references
            intermediate[_hash_func(bucket[0], size)] = (True, slot)
            values[slot] = dictionary[bucket[0]]
    
        return (intermediate, values)
    
    
    def perfect_hash_lookup(key, intermediate, values, _hash_func=fnv_hash_int):
        "Look up a value in the hash table defined by intermediate and values"
        direct, d = intermediate[_hash_func(key, len(intermediate))]
        return values[d if direct else _hash_func(key, len(values), d=d)]
    

    The above produces two lists with 50k entries each; the values in the first table are (boolean, integer) tuples with the integers in the range [0, tablesize) (in theory the values could range up to 2^16 but I'd be very surprised if it ever took 65k+ attempts to find a slot arrangement for your data). Your table size is < 50k, so the above arrangement makes it possible to store the entries in this list in 4 bytes (bool and short make 3, but alignment rules add one byte padding) when expressing this as a C array.

    A quick test to see of the hash tables are correct and produce the right output again:

    >>> tables = gen_minimal_perfect_hash(testdata)
    >>> for key, value in testdata.items():
    ...     assert perfect_hash_lookup(key, *tables) == value
    ...
    

    You only need to implement the lookup function in C:

    • The fnv_hash_int operation can take a pointer to your 64-bit integer, then just cast that pointer to an array of 8-bit values and increment an index 8 times to access each separate byte; use a suitable function to ensure big-endian (network) order.
    • You don't need to mask with 0xffffffff in C, as overflow on a C integer value is automatically discarded anyway.
    • len(intermediate) == len(values) == len(dictionary) and can be captured in a constant.
    • Assuming C99, store the intermediate table as an array of a struct type with flag being a bool, d as an unsigned short; that's just 3 bytes, plus 1 padding byte to align on a 4-byte boundary. The datatype in the values array depends on the values in your input dictionary.

    If you forgive my C skills, here's a sample implementation:

    mph_table.h

    #include "mph_generated_table.h"
    #include <arpa/inet.h>
    #include <stdint.h>
    
    #ifndef htonll
    // see https://stackoverflow.com/q/3022552
    #define htonll(x) ((1==htonl(1)) ? (x) : ((uint64_t)htonl((x) & 0xFFFFFFFF) << 32) | htonl((x) >> 32))
    #endif
    
    uint64_t mph_lookup(uint64_t key);
    

    mph_table.c

    #include "mph_table.h"
    #include <stdbool.h>
    #include <stdint.h>
    
    #define FNV_OFFSET 0x811c9dc5
    #define FNV_PRIME 0x01000193
    
    uint32_t fnv_hash_modulo_table(uint32_t d, uint64_t key) {
        d = (d == 0) ? FNV_OFFSET : d;
        uint8_t* keybytes = (uint8_t*)&key;
        for (int i = 0; i < 8; ++i) {
            d = (d * FNV_PRIME) ^ keybytes[i];
        }
        return d % TABLE_SIZE;
    }
    
    uint64_t mph_lookup(uint64_t key) {
        _intermediate_entry entry = 
            mph_tables.intermediate[fnv_hash_modulo_table(0, htonll(key))];
        return mph_tables.values[
            entry.flag ?
                entry.d : 
                fnv_hash_modulo_table((uint32_t)entry.d, htonll(key))];
    }
    

    which would rely on a generated header file, produced from:

    from textwrap import indent
    
    template = """\
    #include <stdbool.h>
    #include <stdint.h>
    
    #define TABLE_SIZE %(size)s
    
    typedef struct _intermediate_entry {
        bool flag;
        uint16_t d;
    } _intermediate_entry;
    typedef struct mph_tables_t {
        _intermediate_entry intermediate[TABLE_SIZE];
        uint64_t values[TABLE_SIZE];
    } mph_tables_t;
    
    static const mph_tables_t mph_tables = {
        {  // intermediate
    %(intermediate)s
        },
        {  // values
    %(values)s
        }
    };
    """
    
    tables = gen_minimal_perfect_hash(dictionary)
    size = len(dictionary)
    cbool = ['false, ', 'true,  ']
    perline = lambda i: zip(*([i] * 10))
    entries = (f'{{{cbool[e[0]]}{e[1]:#06x}}}' for e in tables[0])
    intermediate = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
    entries = (format(v, '#018x') for v in tables[1])
    values = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
    
    with open('mph_generated_table.h', 'w') as generated:
        generated.write(template % locals())
    

    where dictionary is your input table.

    Compiled with gcc -O3 the hash function is inlined (loop unrolled) and the whole mph_lookup function clocks in at 300 CPU instructions. A quick benchmark looping through all 50k random keys I generated shows my 2.9 GHz Intel Core i7 laptop can look up 50 million values for those keys per second (0.02 microseconds per key).