I wanna create unique random numbers with size of m
and range of 0 to n-1
. I think there are 2 traditional ways to do this as below.
val m = 100
val n = 1000
val set = mutableSetOf<Int>()
while (set.size < m) {
val v = Random.nextInt(n)
set.add(v)
}
val m = 100
val n = 1000
val set = (0..n).shuffled().subList(0, m).toSet()
Both are good ways to get unique random numbers, but they need some storage to check if numbers are unique or not.
However, I need to generate unique random numbers without any space, because the m
or n
would be quite big(e.g. a billion). And note that the generated numbers do not have to be same every time the code is executed.(like Random
with seed). Is there any algorithm for this ?
First observation. An MD5 hash is a 128 bit number. Dividing 2^128
by this therefore gives a number in the range 0 to 1.
If we use an MD5 hash as a random number, use that to make a decision
, and then MD5 hash decision + hash
, we'll get another random number. Therefore you can write a program that makes a set of decisions off of random information. Any time you make the same sequence of decisions, the same random numbers will come up. But once any decision changes, the random numbers will change thereafter.
Using this we can write a function allowing us to make a random choice from a binomial distribution:
count = binomial_choices_before(seed, n, m, k)
This function will answer the question, "We chose m
distinct numbers out of the range 0..n-1
. How many are before k
?" Internally it will need to make a series of decisions, replacing the seed
, but the caller will still have the original seed
. I will leave details of that function to the reader. (You probably want to use a normal approximation to get into a range, then more exact calculations from there. But, play around with it. It is enough for me that it is doable.)
So how is the overall random choice going to work? We will start with the following information:
seed = random information
n = range 0..n-1
m = how many will be chosen
i = we're choosing the i'th of m
First we use binomial_choices_before
to figure out how many of those m
are in the range 0..(n//2 - 1)
. (I'm using //
as notation for integer division.) Call that m1
.
Use that fact as a decision to modify the seed
and generate another random number.
Next we will figure out which side we will go on, and how many went on that side before us. For now, assume we can, and I'll give details of how later. We will NOT use this decision process to modify the seed
.
We've now figured out that our number is in a smaller range of size around n/2
. We know how many other choices are in that range, and it is probably around m/2
. We know where we are among those choices, and it is probably around i/2
. We can proceed recursively until we are in a range with one thing. We then make a final random choice of which one, and that is our answer.
Now let's explain how we figure out which choice we make, and where we are in the list of choices that we made. (Those two are intertwined.)
We'll do this in a separate function. So we can mangle the seed
all we want, and not affect the larger decision flow.
First we modify the seed
with the knowledge that we're trying to figure out how many of the m1
that went on the first side do so out of the first m//2
. (So this is a different random number than we produce elsewhere in the decision process.)
Next, we produce that answer. Again, we're doing it using binomial_choices_before
.
Now we know if i
is in the first half of the choices, or the second. If in the first half of choices, we have put i
in a smaller population of choices. If in the second half of choices, we know how many of the first m//2
went on each side, and we STILL have put i
in a smaller population of choices.
We then recursively go into our half of the choices until we figure out how many of the first i-1
choices went into the first range of numbers, and also how many of the first i
did. If those are the same, we went into the second range, and we know how many did before. If they are different, we went into the first range, and we know how many did before. Either way we have our answer.
NOTE Suppose that we run this twice, for different numbers i
and j
. For as far in the decision tree as i
and j
remain in the same group, they will make all the same decisions, generate all the same seed
values, and come to all the same conclusions. Once we split them into separate ranges, each will modify its seed, and then on make uncorrelated decisions that cannot come into conflict.
The result is an entirely deterministic, but unpredictable, function that takes seed, n, m, i
and uses seed
to produce the i
th number in the choice of m
out of n
in such a way that the numbers will be produced in random order, and the same number will not be produced twice.
Why does this work? We have random numbers that are tied to a chain of decisions. Any two numbers that will be output close to each other will make a sequence of the same decisions, generating the same random numbers, until they are in separate ranges. And any two inputs that are close to each other whose outputs are also close to each other will make the same decisions about where they are within the range of inputs going to this output range, until their input or output ranges are separated. Therefore we can guarantee that if a particular input went to a particular output, no other input went to that output.
Unfortunately this procedure requires a ton of MD5 hashing. So it will be slow. But it is embarrassingly parallel.
If you stick an LRU cache in front of a lot of the logic, you can skip a lot of the repeated logic to speed it up. How much it speeds up depends on how much caching you do. But it still will be slow. :-(