Search code examples
pythonrandomlazy-evaluationsampling

Lazily sample random results in python


Python question. I'm generating a large array of objects, which I only need to make a small random sample. Actually generating the objects in question takes a while, so I wonder if it would be possible to somehow skip over those objects that don't need generating and only explicitly create those objects that have been sampled.

In other words, I now have

a = createHugeArray()
s = random.sample(a,len(a)*0.001)

which is rather wasteful. I would prefer something more lazy like

a = createArrayGenerator()
s = random.sample(a,len(a)*0.001)

I don't know if this works. The documentation on random.sample isn't too clear, though it mentions xrange as being very fast - which makes me believe it might work. Converting the array creation to a generator would be a bit of work (my knowledge of generators is very rusty), so I want to know if this works in advance. :)

An alternative I can see is to make a random sample via xrange, and only generate those objects that are actually selected by index. That's not very clean though, because the indices generated are arbitrary and unnecessary, and I would need rather hacky logic to support this in my generateHugeArray method.

For bonus points: how does random.sample actually work? Especially, how does it work if it doesn't know the size of the population in advance, as with generators like xrange?


Solution

  • There does not seem a way that avoids figuring out how the indices map to your permutations. If you don't know this, how would you create a random object from your array? You could either use the trick using xrange() you suggested yourself, or implement a class defining the __getitem__() and __len__() methods and pass and object of this class as population argument to random.sample().

    Some further comments:

    • Converting createHugeArray() into a generator won't buy you anything -- random.sample() will just not work anymore. It needs an object supporting len().

    • So it does need to know the number of elements in the population right from the beginning.

    • The implementation features two different algorithms and chooses the one that will use less memory. For relatively small k (that is, in the case at hand) it will simply save the indices already chosen in a set and make a new random choice if it hits one of them.

    Edit: A completely different approach would be to iterate over all permutations once and decide for each permutation if it should be included. If the total number of permutations is n and you would like to select k of them, you could write

    selected = []
    for i in xrange(n):
        perm = nextPermutation()
        if random.random() < float(k-len(selected))/(n-i):
            selected.append(perm)
    

    This would choose exactly k permutations randomly.