Search code examples
pythonmemorysamplinguniform

Sampling from a huge uniform distribution in Python


I need to select 3.7*10^8 unique values from the range [0, 3*10^9] and either obtain them in order or keep them in memory.

To do this, I started working on a simple algorithm where I sample smaller uniform distributions (that fit in memory) in order to indirectly sample the large distribution that really interests me.

The code is available at the following gist https://gist.github.com/legaultmarc/7290ac4bef4edb591d1e

Since I'm having trouble implementing something more robust, I was wondering if you had other ideas to sample unique values from a large discrete uniform. I'm looking for either an algorithm, a module or an idea on how to manage very large lists directly (perhaps using the hard drive instead of memory).


Solution

  • There is an interesting post, Generating sorted random ints without the sort? O(n) which suggests that instead of generating uniform random ints, you can do a running-sum on exponential random deltas, which gives you a uniform random result generated in sorted order.

    It's not guaranteed to give exactly the number of samples you want, but should be pretty close, and much faster / lower memory requirements.

    Edit: I found a second post, generating sorted random numbers without exponentiation involved? which suggests tweaking the distribution density as you go to generate an exact number of samples, but I am leery of just exactly what this would do to your "uniform" distribution.

    Edit2: Another possibility that occurs to me would be to use an inverse cumulative binomial distribution to iteratively split your sample range (predict how many uniformly generated random samples would fall in the lower half of the range, then the remainder must be in the upper half) until the block-size reaches something you can easily hold in memory.