I'm following an article where I've got a hash table with a fixed number of 2048 baskets.
The hash function takes a pointer and the hash table itself, treats the address as a bit-pattern, shifts it right three bits and reduces it modulo the size of the hash table (2048):
(It's written as a macro in this case):
#define hash(p, t) (((unsigned long)(p) >> 3) & \
(sizeof(t) / sizeof((t)[0]) - 1))
The article, however, doesn't elaborate on why it's right-shifting the address by three bits (and it seems a bit arbitrary at first). My first guess was that the reason is to sort of group pointers with a similar address by cutting off the last three bits but I don't see how this would be useful given that most addresses allocated for one application have similar addresses anyway; take this as an example:
#include <stdio.h>
int main()
{
int i1 = 0, i2 = 0, i3 = 0;
printf("%p\n", &i1);
printf("%p\n", &i2);
printf("%p\n", &i3);
printf("%lu\n", ((unsigned long)(&i1) >> 3) & 2047); // Provided that the size of the hash table is 2048.
printf("%lu\n", ((unsigned long)(&i2) >> 3) & 2047);
printf("%lu", ((unsigned long)(&i3) >> 3) & 2047);
return 0;
}
Also, I'm wondering why it's choosing 2048 as a fixed size and if this is in relation to the three-bit shift.
For reference, the article is an extract from "C Interfaces and Implementations, Techniques for creating reusable software" by David P. Hanson.
Memory allocations must be properly aligned. I.e. the hardware may specify that an int
should be aligned to a 4-byte boundary, or that a double
should be aligned to 8 bytes. This means that the last two address bits for an int
must be zero, three bits for the double
.
Now, C allows you to define complex structures which mix char
, int
, long
, float
, and double
fields (and more). And while the compiler can add padding to align the offsets to the fields to the appropriate boundaries, the entire structure must also be properly aligned to the largest alignment that one of its members uses.
malloc()
does not know what you are going to do with the memory, so it must return an allocation that's aligned for the worst case. This alignment is specific to the platform, but it's generally not less than 8-byte alignment. A more typical value today is 16-byte alignment.
So, the hash algorithm simply cuts off the three bits of the address which are virtually always zero, and thus less than worthless for a hash value. This easily reduces the number of hash collisions by a factor of 8. (The fact that it only cuts off 3-bits indicates that the function was written a while ago. Today it should be programmed to cut off four bits.)