I am reading about Universal hashing on integers. The prerequisite and mandatory precondition seems to be that we choose a prime number p
greater than the set of all possible keys.
I am not clear on this point.
If our set of keys are of type int
then this means that the prime number needs to be of the next bigger data type e.g. long
.
But eventually whatever we get as the hash would need to be down-casted to an int to index the hash table. Doesn't this down-casting affect the quality of the Universal Hashing (I am referring to the distribution of the keys over the buckets) somehow?
If our set of keys are integers then this means that the prime number needs to be of the next bigger data type e.g. long.
That is not a problem. Sometimes it is necessary otherwise the hash family cannot be universal. See below for more information.
But eventually whatever we get as the hash would need to be down-casted to an
int
to index the hash table.Doesn't this down-casting affect the quality of the Universal Hashing (I am referring to the distribution of the keys over the buckets) somehow?
The answer is no. I will try to explain.
Whether p
has another data type or not is not important for the hash family to be universal. Important is that p
is equal or larger than u
(the maximum integer of the universe of integers). It is important that p
is big enough (i.e. >= u
).
A hash family is universal when the collision probability is equal or smaller than
1/m
.
So the idea is to hold that constraint.
The value of p
, in theory, can be as big as a long
or more. It just needs to be an integer and prime.
u
is the size of the domain/universe (or the number of keys). Given the universe U = {0, ..., u-1}
, u
denotes the size |U|
.m
is the number of bins or bucketsp
is a prime which must be equal or greater than n
H = {h(a,b)(x)}
with h(a,b)(x) = ((a * x + b) mod p) mod m
. Note that a
and b
are randomly chosen integers (from all possible integers, so theoretically can be larger than p
) modulo a prime p
(which can make them either smaller or larger than m
, the number of bins/buckets); but here too the data type (domain of values does not matter). See Hashing integers on Wikipedia for notation._p/m_ * 1/(p-1)
(the underscores mean to truncate the decimals). For p >> m
(p
considerably bigger than m
) the probability tends to 1/m
(but this does not mean that the probability would be better the larger p
is).In other terms answering your question: p
being a bigger data type is not a problem here and can be even required. p
has to be equal or greater than u
and a
and b
have to be randomly chosen integers modulo p
, no matter the number of buckets m
. With these constraints you can construct a universal hash family.
unsigned char
(in C for example). Then U = {0, ..., 255}
Let p
be (next possible) prime equal or greater than 256
. Note that p
can be any of these types (short
, int
, long
be it signed or unsigned). The point is that the data type does not play a role (In programming the type mainly denotes a domain of values.). Whether 257
is short
, int
or long
doesn't really matter here for the sake of correctness of the mathematical proof. Also we could have chosen a larger p
(i.e. a bigger data type); this does not change the proof's correctness.
257
. 25
buckets, i.e. m = 25
. This means a hash family would be universal if the collision probability is equal or less than 1/25
, i.e. approximately 0.04
._p/m_ * 1/(p-1)
: _257/25_ * 1/256 = 10/256 = 0.0390625
which is smaller than 0.04
. It is a universal hash family with the chosen parameters.We could have chosen m = u = 256
buckets. Then we would have a collision probability of 0.003891050584
, which is smaller than 1/256 = 0,00390625
. Hash family is still universal.
Let's try with m
being bigger than p
, e.g. m = 300
. Collision probability is 0, which is smaller than 1/300 ~= 0.003333333333
. Trivial, we had more buckets than keys. Still universal, no collisions.
We have the following:
x
of type int
(an element of |U|
)a
, b
, p
of type long
m
we'll see later in the example
p
so that it is bigger than the max u
(element of |U|
), p
is of type long
.a
and b
(modulo p
) randomly. They are of type long
, but always < p
.x
(of type int
from U
) calculate ((a*x+b) mod p)
. a*x
is of type long
, (a*x+b)
is also of type long
and so ((a*x+b) mod p
is also of type long
. Note that ((a*x+b) mod p)
's result is < p
. Let's denote that result h_a_b(x)
.h_a_b(x)
is now taken modulo m
, which means that at this step it depends on the data type of m
whether there will be downcasting or not. However, it does not really matter. h_a_b(x)
is < m
, because we take it modulo m
. Hence the value of h_a_b(x) modulo m
fits into m
's data type. In case it has to be downcasted there won't be a loss of value. And so you have mapped a key to a bin/bucket.