Search code examples
pythonnumpyrandom

Understanding the role of `shuffle` in np.random.Generator.choice()


From the documentation for numpy's random.Generator.choice function, one of the arguments is shuffle, which defaults to True.

The documentation states:

shuffle bool, optional

Whether the sample is shuffled when sampling without replacement. Default is True, False provides a speedup.

There isn't enough information for me to figure out what this means. I don't understand why we would shuffle if it's already appropriately random, and I don't understand why I would be given the option to not shuffle if that yields a biased sample.

If I set shuffle to False am I still getting a random (independent) sample? I'd love to also understand why I would ever want the default setting of True.


Solution

  • You are still getting a random choice regardless of your selection for shuffle. If you select shuffle=False, however, the ordering of the output is not independent of the ordering of the input.

    This is easiest to see when the number of items chosen equals the total number of items:

    import numpy as np
    rng = np.random.default_rng()
    x = np.arange(10)
    rng.choice(x, 10, replace=False, shuffle=False)
    # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    rng.choice(x, 10, replace=False, shuffle=True)
    # array([8, 1, 3, 9, 6, 5, 0, 7, 4, 2])
    

    If you reduce the number of items chosen and use shuffle=False, you can confirm that which item(s) are missing is distributed as expected.

    import numpy as np
    import matplotlib.pyplot as plt
    rng = np.random.default_rng()
    x = np.arange(10)
    set_x = set(x)
    missing = []
    for i in range(10000):
        # By default, all `p` are equal, so which item is
        # missing should be uniformly distributed
        y = rng.choice(x, 9, replace=False, shuffle=False)
        set_y = set(y)
        missing.append(set_x.difference(set_y).pop())
    plt.hist(missing)
    

    enter image description here

    But you'll see that items that appeared earlier in x tend to appear earlier in the output and vice-versa. That is, the input and output orders are correlated.

    x = np.arange(10)
    correlations = []
    for i in range(10000):
        y = rng.choice(x, 9, replace=False, shuffle=False)
        correlations.append(stats.spearmanr(np.arange(9), y).statistic)
    plt.hist(correlations)
    

    enter image description here

    If that is ok for your application, feel free to set shuffle=False for a speedup.

    %timeit rng.choice(10000, 5000, replace=False, shuffle=True)
    # 187 µs ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    %timeit rng.choice(10000, 5000, replace=False, shuffle=False)
    # 146 µs ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    The more items that are to be chosen, the more pronounced the speedup.

    %timeit rng.choice(10000, 1, replace=False, shuffle=True)
    # 17.6 µs ± 3.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    %timeit rng.choice(10000, 1, replace=False, shuffle=False)
    # 16.5 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    vs

    %timeit rng.choice(10000, 9999, replace=False, shuffle=True)
    # 214 µs ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    %timeit rng.choice(10000, 9999, replace=False, shuffle=False)
    # 124 µs ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)