I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.
What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.
Thanks for any sugestions!
Here's my current solution.
import numpy as np
import itertools as itr
import random as rdm
#works great
x_count = 10
y_count = 5
#good luck!
#x_count = 10000
#y_count = 20000
x_indices = np.arange(x_count)
y_indices = np.arange(y_count)
cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)
for i in range(25):
print list_cell_indices[i]
So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:
import numpy as np
import itertools as itr
import random as rdm
x_count = 10000
y_count = 5000
sample_count = 10000
keep_probability = 0.01
tried_cells = set()
kept_cells = set()
while len(kept_cells) < sample_count:
x = rdm.randint(0, x_count)
y = rdm.randint(0, y_count)
if (x, y) in tried_cells:
pass
else:
tried_cells.add((x, y))
keep = rdm.random() < keep_probability
if keep:
kept_cells.add((x,y))
print "worked"
In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.
How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).
import numpy
x_count = 10000
y_count = 20000
x_indices = numpy.arange(x_count)
y_indices = numpy.arange(y_count)
large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
print large_table
def get_random_item(sample_size):
from random import sample
for i in sample(xrange(y_count * x_count), sample_size):
y,x = divmod(i, y_count)
yield (x,y)
for x,y in get_random_item(10):
print '%12i x: %5i y: %5i' % (large_table[x][y], x,y)
Which returns:
(first to simulate your existing 2-D array you created via product)
[[ 0 1 2 ..., 9997 9998 9999]
[ 10000 10001 10002 ..., 19997 19998 19999]
[ 20000 20001 20002 ..., 29997 29998 29999]
...,
[199970000 199970001 199970002 ..., 199979997 199979998 199979999]
[199980000 199980001 199980002 ..., 199989997 199989998 199989999]
[199990000 199990001 199990002 ..., 199999997 199999998 199999999]]
Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]
154080675 x: 15408 y: 675
186978188 x: 18697 y: 8188
157506087 x: 15750 y: 6087
168859259 x: 16885 y: 9259
29775768 x: 2977 y: 5768
94167866 x: 9416 y: 7866
15978144 x: 1597 y: 8144
91964007 x: 9196 y: 4007
163462830 x: 16346 y: 2830
62613129 x: 6261 y: 3129
sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.
I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.