Search code examples
pythonrandomnumpypython-itertools

memory efficient random number iterator without replacement


I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.

What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.

Thanks for any sugestions!

Here's my current solution.

import numpy as np
import itertools as itr
import random as rdm

#works great
x_count = 10
y_count = 5

#good luck!
#x_count = 10000
#y_count = 20000

x_indices = np.arange(x_count)
y_indices = np.arange(y_count)

cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)

for i in range(25):
    print list_cell_indices[i]

So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:

import numpy as np
import itertools as itr
import random as rdm

x_count = 10000
y_count = 5000

sample_count = 10000
keep_probability = 0.01


tried_cells = set()
kept_cells = set()

while len(kept_cells) < sample_count:
    x = rdm.randint(0, x_count)
    y = rdm.randint(0, y_count)

    if (x, y) in tried_cells:
        pass
    else:
        tried_cells.add((x, y))
        keep = rdm.random() < keep_probability
        if keep:
            kept_cells.add((x,y))


print "worked"

In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.


Solution

  • How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).

    import numpy
    
    x_count = 10000
    y_count = 20000
    
    x_indices = numpy.arange(x_count)
    y_indices = numpy.arange(y_count)
    
    large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
    print large_table
    
    def get_random_item(sample_size):
        from random import sample
        for i in sample(xrange(y_count * x_count), sample_size):
            y,x = divmod(i, y_count)
            yield (x,y)
    
    for x,y in get_random_item(10):
        print '%12i   x: %5i y: %5i' % (large_table[x][y],  x,y)
    

    Which returns:

    (first to simulate your existing 2-D array you created via product)

    [[        0         1         2 ...,      9997      9998      9999]
     [    10000     10001     10002 ...,     19997     19998     19999]
     [    20000     20001     20002 ...,     29997     29998     29999]
     ..., 
     [199970000 199970001 199970002 ..., 199979997 199979998 199979999]
     [199980000 199980001 199980002 ..., 199989997 199989998 199989999]
     [199990000 199990001 199990002 ..., 199999997 199999998 199999999]]
    

    Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]

       154080675   x: 15408 y:   675
       186978188   x: 18697 y:  8188
       157506087   x: 15750 y:  6087
       168859259   x: 16885 y:  9259
        29775768   x:  2977 y:  5768
        94167866   x:  9416 y:  7866
        15978144   x:  1597 y:  8144
        91964007   x:  9196 y:  4007
       163462830   x: 16346 y:  2830
        62613129   x:  6261 y:  3129
    

    sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.

    I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.