Search code examples
pythonnumpyrandomsample

How to generate random numbers fast from "An Extremely Big Array" in Python?


Recently, I got stuck searching for a fast way to generate random numbers in Python. Here like the code below, I collect points from an area with more than 10 million points. And I want to sample 8192 points randomly from the area points. However, np.random.choice(np.arange(...)) makes the program so slow since the large index arrays.

So, how to make this procedure faster? With any useful Python package or function? Thanks for your attention!

import numpy as np

area_idx = self.area_idxs[idx]
points = self.area_points[area_idx]  # 14350962 * 3
labels = self.area_labels[area_idx]  # 14350962
npoints = 8192
# np.random.choice(np.arange(...)) makes the program so slow.
selected_point_idxs = np.random.choice(np.arange(len(points)), npoints, replace=False)

Generate random numbers fast from "An Extremely Big Array" in Python.


Solution

  • Numpy mentioned in the document that new code should use the new random generator to generate random numbers (refer to note in numpy.random.choice), and the new random generator will have better performance in choice:

    In [_]: n, k = 14350962, 8192
    
    In [_]: %timeit np.random.choice(np.arange(n), k, replace=False)
    772 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [_]: %timeit np.random.choice(n, k, replace=False)
    742 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [_]: rng = np.random.default_rng()
    
    In [_]: %timeit rng.choice(np.arange(n), k, replace=False)
    19.1 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [_]: %timeit rng.choice(n, k, replace=False)
    147 µs ± 687 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    And when indices are directly generated by np.arange, the same result can be obtained by passing only the length, avoiding the construction of indices.

    Some supplements

    In the new Generator.choice, for cases where k is small, the numpy team uses Floyd's sampling algorithm with a time complexity of O(k), while the old RandomState.choice always uses a random permutation with a time complexity of O(n).