Search code examples
pythonnumpyoptimizationnumpy-random

How can I quickly generate a (large) list of random numbers given a list of seeds as input?


I need to make a function that takes in an array of integers and returns a list of random numbers with the same length as the array. However, there is the restriction that the output random number corresponding to a given entry in the input array should always be the same based upon that entry.

For example, if input_a given below returns the following:

> input_a = np.array([1, 2, 3, 4, 5])
> random_array(input_a)
[0.51689016 0.62747792 0.16585436 0.63928942 0.30514275]

Then input_b given below should return the following:

> input_b = np.array([3, 2, 3])
> random_array(input_b)
[0.16585436 0.62747792 0.16585436]

Note that the output numbers that correspond to an input of 3 are all the same, and likewise for those that correspond to an input of 2. In effect, the values of the input array are used as seeds for the output array.

The main issue is that the input arrays may be very big, so I'd need something that can do the operation efficiently.

My naive implementation is as follows, making a list of random number generators with the input array as seeds.

import numpy as np

def random_array(input_array):
    rng_list = [np.random.default_rng(seed=i) for i in input_array]
    return [rng.random() for rng in rng_list]

input_a = np.array([1, 2, 3])
input_b = np.array([3, 2, 3])

print(random_array(input_a)) # [0.5118216247002567, 0.2616121342493164, 0.08564916714362436]
print(random_array(input_b)) # [0.08564916714362436, 0.2616121342493164, 0.08564916714362436]

It works as intended, but it's terribly slow for what I need it to do - unsurprising, given that it's doing a loop over array entries. This implementation takes 5 seconds or so to run on an input array of length 100,000, and I'll need to do this for much larger inputs than that.

How can I do this but more efficiently?

Edit to add information: A typical length for an input array is around 200 million. Its value range may greatly exceed its length - np.max(input_a) may be in the trillions - but all its values can be assumed to be nonnegative. The number of repeated values is small in comparison to the length of the array.

What I'm specifically trying to do is take a set of particles in the output of a simulation (totalling about 200 billion particles) and make a smaller set of particles randomly sampled ("downsampled") from the large set (the downsampled output should have about 1% of the total particles). The particles are each labelled with an ID, which is nonnegative but can be very large. The simulation output is split into a dozen or so "snapshots", each of which store each particle's position (among other things); each snapshot is split into "subsnapshots", individual files that store the IDs/positions/etc. of about 200 million particles each. The particle IDs within a snapshot are unique, but the same particle (with the same ID) will naturally appear in multiple snapshots.

What I'm trying to do is make a mask that decides whether to keep a particle or not based upon its ID. The idea is that if a particle is kept in one snapshot, it should be kept in all snapshots. RAM is not infinite; only one subsnapshot's worth of particle info can be loaded in at once. The particles in the Nth subsnapshot within a given snapshot are the same as the particles in the Nth subsnapshot for a different snapshot, but not necessarily in the same order. Lookup files can in principle be saved and read, but again, only one subsnapshot's worth at a time (and this is slow so if there's a better way that'd be ideal).

This is the motivation behind making a function that assigns an RNG value to many particles at once (the RNG value is used to determine if the particle will be kept or not) that is based upon and consistent with an input array (the array of particle IDs, to ensure that if a particle is kept it is always kept).


Solution

  • It may be good enough to implement a simple pseudo random number generator where each int in the array acts as a seed. Splitmix 64 is pretty simple and consecutive integer seeds generate very different results. Or explore other pseudo random options.

    import numpy as np
    
    TWO53 = 1 << 53
    
    def sm64( arr ):
        """ Splitmix 64 Pseudo random number generator. """ 
        arr = arr.astype( np.uint64 )  # change to unsigned, keeps the bit pattern.
        arr += 0x9e3779b97f4a7c15      
        arr = ( arr ^ ( arr >> 30 )) * 0xbf58476d1ce4e5b9
        arr = ( arr ^ ( arr >> 27 )) * 0x94d049bb133111eb
        arr ^= arr >> 31
        return ( arr >> 11 ) / TWO53  # float64 has 53 bits of precision.
    
    test = np.arange( -10, 10 )
    
    test                                                                   
    # array([-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,
    #          3,   4,   5,   6,   7,   8,   9])
    
    sm64( test )                                                           
    # array([0.83014881, 0.8990996 , 0.53510476, 0.42233422, 0.91005283,
    #        0.08865044, 0.7554433 , 0.96629362, 0.94971076, 0.89394292,
    #        0.88331081, 0.56656158, 0.59118973, 0.11345034, 0.43145582,
    #        0.38676805, 0.73981701, 0.38982975, 0.61850463, 0.68236273])
    

    This ran in 240ms for a 10 million int array.