Search code examples
pythonperformanceoptimizationrandommathematical-optimization

1B pseudo-random boolean permutations of vector set N by M


Let's say we have a set of pseudo-random boolean/binary vectors of length N, and each vector is length M.

e.g. {[0,1,0,1], [0,0,0,1]}

We need one billion unique vectors.

Now, the question is, what is the fastest way to do this?? (also please keep in mind we need to keep the vector set in memory, so that later we can use the same set to do other stuff)

For example... here are some approaches I have tried...

## where N = 10 and M = 10,000
def idx_v(idx, ivals=vector_set(N,M), ival=np.zeros(10_000)):
  divs = idx // 10
  remains = idx % 10
  a = [divs,remains]
  hv = str( abs( hash( str(a) ) ) )
  ival = ivals[remains]
  for i in hv:
    ival = np.roll(ival, int(i))
  return ival, divs, remains

for idx in range(1_000_000_000):
  ival = idx_v(idx)

that takes about 0.006 seconds per generation of 10 vectors

##where N = 100 and M = 10,000 
def idx_v(idx, ivals=vector_set(N,M), ival=np.zeros(10_000)):
  divs = idx // 100
  remains = idx % 100
  ival = np.roll(ivals[remains], divs )
  return ival, divs, remains

for idx in range(1_000_000_000):
  ival = idx_v(idx)

that takes about 0.005 seconds per generation of 10 vectors


Solution

  • All the complex mixing and operating on the numbers doesn't really do all that much. Here's a solution that allows you to either have a decent indexable collection of pseudorandom vectors, or allows you to reproduce the exact same indexable collection every time:

    import numpy as np
    
    
    def seeded_vector(n, i, seed=np.random.randint(0, np.iinfo(np.int32).max)):
        np.random.seed(seed + i)
        return np.random.randint(2, size=n)
    
    
    print('10 random vectors of length 10:')
    for i in range(10):
        print(seeded_vector(10, i))
    
    print('and the same 10 vectors again:')
    for i in range(10):
        print(seeded_vector(10, i))
    
    print('a collection of 10 vectors that is fixed if the seed is set:')
    for i in range(10):
        print(seeded_vector(10, i, seed=42))
    
    print('and again:')
    for i in range(10):
        print(seeded_vector(10, i, seed=42))
    

    Output on my system:

    10 random vectors of length 10:
    [1 1 0 1 1 0 0 1 1 1]
    [1 1 0 1 1 1 0 0 1 0]
    [0 0 1 1 1 0 1 1 1 0]
    [0 0 1 1 0 0 0 0 0 0]
    [0 0 1 0 1 0 1 0 0 1]
    [0 1 1 0 1 1 0 1 0 0]
    [0 0 0 0 0 0 1 1 1 0]
    [1 0 0 1 1 1 1 0 1 0]
    [0 0 0 1 1 0 0 1 1 0]
    [0 1 1 1 1 0 1 0 1 0]
    and the same 10 vectors again:
    [1 1 0 1 1 0 0 1 1 1]
    [1 1 0 1 1 1 0 0 1 0]
    [0 0 1 1 1 0 1 1 1 0]
    [0 0 1 1 0 0 0 0 0 0]
    [0 0 1 0 1 0 1 0 0 1]
    [0 1 1 0 1 1 0 1 0 0]
    [0 0 0 0 0 0 1 1 1 0]
    [1 0 0 1 1 1 1 0 1 0]
    [0 0 0 1 1 0 0 1 1 0]
    [0 1 1 1 1 0 1 0 1 0]
    a collection of 10 vectors that is fixed if the seed is set:
    [0 1 0 0 0 1 0 0 0 1]
    [0 0 1 1 1 0 0 1 1 1]
    [0 1 1 1 1 1 0 0 1 1]
    [1 0 1 0 0 1 1 1 1 0]
    [1 1 0 0 0 1 0 1 1 0]
    [1 0 1 0 0 1 0 1 1 0]
    [0 1 1 0 1 0 0 0 0 0]
    [0 1 0 1 0 0 0 0 1 1]
    [0 0 1 1 1 0 0 0 1 0]
    [0 1 1 0 1 1 0 1 0 0]
    and again:
    [0 1 0 0 0 1 0 0 0 1]
    [0 0 1 1 1 0 0 1 1 1]
    [0 1 1 1 1 1 0 0 1 1]
    [1 0 1 0 0 1 1 1 1 0]
    [1 1 0 0 0 1 0 1 1 0]
    [1 0 1 0 0 1 0 1 1 0]
    [0 1 1 0 1 0 0 0 0 0]
    [0 1 0 1 0 0 0 0 1 1]
    [0 0 1 1 1 0 0 0 1 0]
    [0 1 1 0 1 1 0 1 0 0]
    

    Note that the bottom two blocks will always be the exact same for every run, while the top two blocks will be random between runs, but always identical for each run.

    Note that, depending on the exact algorithm used by Python/Numpy, it's possible you'd get the same result every time on one system, but not across all systems. I'd expect the result to be the same for any run on any machine - at least running a similar version of Python and Numpy. But I haven't investigated if that is guaranteed.

    You were asking for the fastest solution - I doubt you can get a lot faster than this in terms of time up front, since it literally only takes the time to define the function that returns the vectors. However, if you have to access these vectors very frequently, you could consider adding some form of caching to the function so that the vectors don't have to be generated every time but can simply be looked up on repeat access.

    However, I doubt the added overhead will be worth it, unless you access these very frequently indeed - and as others have pointed out, you may find you don't have enough memory to do that for the size of the collection you seem to need. Trading a tiny bit of speed for not needing any storage at all seems like a pretty good deal here.

    If you always access the collection in order, or in fixed blocks, an optimisation could be to return these blocks in one go (or yield them from a generator), without setting the seed before every value. But that would preclude directly indexing a vector individually.