Search code examples
pythonnumpymultiprocessingpoolrandom-seed

What is the *right* way to seed random number generation in a python multiprocessing pool?


I am using Pool.map() in the multiprocessing package on an embarrassingly parallel project. I want to seed the numpy random number generator once for each worker (not once per function call). My understanding from some past answers is that one should use the initializer parameter to seed each worker. However, when I pass numpy.random.seed, I get very poor seeding--all threads generate mostly the same random numbers, but not all.

There have been some changes to the way random numbers work in numpy so perhaps some of those answers are out of date. Take a look at this minimal example that illustrates the issue:

import multiprocessing                                                              
import numpy as np                                                                                                                                               
                                                                                    
def my_fun(_):                                  
    return rng.uniform()                                                                                             
     
if __name__ == "__main__":
    rng = np.random.default_rng()
    with multiprocessing.Pool(processes=4, initializer=np.random.seed) as pool:
        my_list = pool.map(my_fun, range(40))  
    print(f"Number of unique values: {len(set(my_list))}") 

I would expect my_list to contain exactly 40 distinct values if seeding works or exactly 10 if it does not. But it tends to be more like 12-15. Is there a different best-practice for seeding these workers? Remember, I do not want to add any code to my_fun() because it will be called a large number of times by each worker. I just want each worker to start from a different place, so workers will be independent.

I do not require reproducibility for this project, but it would be nice if the solution did provide it. Python 3.10.5 on linux.


Solution

  • You were close. Try this instead:

    import multiprocessing                                                              
    import numpy as np
    
    def init():
        global rng
        rng = np.random.default_rng()
                                                                                        
    def my_fun(_):                                  
        return rng.uniform()                                                                                             
         
    if __name__ == "__main__":
        with multiprocessing.Pool(processes=4, initializer=init) as pool:
            my_list = pool.map(my_fun, range(40))  
        print(f"Number of unique values: {len(set(my_list))}") 
    

    The recommendation is that instead of using seeding, you should create a new instance of the generator instead. Here we're creating one new, freshly seed generator for each pool.

    For reproducible results, add code to init() to pickle each new generator or print its state:

    print(rng.__getstate__())
    

    The output is sufficient to reconstructor the generator state. It looks like this:

    {'bit_generator': 'PCG64',
     'state': 
         {'state': 319129345033546980483845008489532042435,
          'inc': 198751095538548372906400916761105570237},
     'has_uint32': 0,
     'uinteger': 0}