Search code examples
pythonrandomscikit-learnplatformseed

Platform-independent random state in scikit-learn train_test_split


Does setting a specific random seed (random_state) when splitting train/test datasets using scikit-learn produce the same initialization of the random number generator (i.e., produces same pseudo-random numbers) over different platforms - for instance, over different cloud computing instances?

Thanks!


Solution

  • As long as random_state is equal on all platforms and they are all running the same versions of numpy, you should get the exact same splits.

    Since random_state is a numpy instance, I think all of scikit-learn's pseudo-random number generators are frozen because numpy froze RandomState.

    You can check the documentation for random_state here, which as you can see is numpy.random.RandomState. You can check numpy's compatibility guarantee here.