Search code examples
numpymachine-learningrandomrandom-seednumpy-random

Correctly seeding numpy random generator


For my scientific experiments, I usually seed using:

rng = np.random.Generator(np.random.PCG64(seed))

which for the current numpy version is equivalent to

rng = np.random.Generator(np.random.default_rng(seed))

As I repeat my experiments n times and average their results, I usually set the seed to all the numbers between 0 and n.

However, reading the documentations here and here it states that

Seeds should be large positive integers.

or

We default to using a 128-bit integer using entropy gathered from the OS. This is a good amount of entropy to initialize all of the generators that we have in numpy. We do not recommend using small seeds below 32 bits for general use.

However, in the second reference, it also states

There will not be anything wrong with the results, per se; even a seed of 0 is perfectly fine thanks to the processing that SeedSequence does.

This feels contradictory and I wonder, if small seeds are now totally fine to use, or one should move towards higher seeds. Especially, I wonder, (i) at which point (if any) would a large seed make a difference to a low seed and (ii) if one does scientific experiments (e.g. machine learning / algorithmic research) should one prefer higher to lower seeds or should it not make a difference?

PS: This question is highly related to Random number seed in numpy but concerns the now recommended Generator. Furthermore, the answer seems not in-depth enough as it does not include a discussion about high and low seeds.


Solution

  • The justification is in the quick start page which you linked:

    We recommend using very large, unique numbers to ensure that your seed is different from anyone else’s. This is good practice to ensure that your results are statistically independent from theirs unless you are intentionally trying to reproduce their result.

    In short, this is to avoid reproducing someone else's bias (if any) by generating the exact same dataset, since humans are more likely to pick short numbers by default (0, 11, 42) rather than very large ones.

    In your use case this is probably not important.