Search code examples
pythonarraysnumpynumpy-ndarraystatistics-bootstrap

How can I bootstrap the innermost array of a numpy array?


I have a numpy array of these dimensions

data.shape (categories, models, types, events): (10, 11, 50, 100)

Now I want to do sample with replacement in the innermost array (100) only. For a single array such as this:

data[0][0][0]

array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 , 35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 , 16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 , 68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 , 17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 , 77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 , 46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 , 94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 , 35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 , 13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 , 24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 , 15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 , 24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 , 18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 , 14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 , 38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 , 18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 , 41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 , 27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 , 20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ], dtype=float32)

I can do sample with replacement using this: np.random.choice(data[0][0][0], 100), which I will be doing thousands of times.

array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 , 16.026619, 20.203123, 23.430847, 16.512974, 15.524615, 18.967432, 22.860783, 85.41213 , 21.779675, 23.586731, 24.26776 , 66.577 , 20.904285, 19.068447, 21.960285, 68.874756, 31.585844, 23.586731, 61.161705, 101.15249 , 59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 , 23.430847, 14.781404, 40.448624, 13.629236, 24.26776 , 19.068447, 16.026619, 16.512974, 16.108913, 77.40547 , 12.629299, 31.585844, 24.798452, 18.967432, 14.781404, 23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178, 16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 , 31.20367 , 17.970661, 29.745787, 39.459843, 10.792178, 43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497, 21.50163 , 18.788296, 20.904285, 17.352108, 41.798622, 18.447166, 16.108913, 19.068447, 61.161705, 52.865868, 20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782, 41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696, 20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 , 23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 , 17.352108, 46.886642, 27.399046, 18.008326, 15.176119], dtype=float32)

But since there is no axis in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?


Solution

  • The fastest/simplest answer turns out to be based on indexing a flattened version of your array:

    def resampFlat(arr, reps):
        n = arr.shape[-1]
    
        # create an array to shift random indexes as needed
        shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)
    
        # get a flat view of the array
        arrflat = arr.ravel()
        # sample the array by generating random ints and shifting them appropriately
        return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift] 
                         for i in range(reps)])
    

    Timings confirm that this is the fastest answer.

    Timings

    I tested out the above resampFlat function alongside a simpler for loop based solution:

    def resampFor(arr, reps):
        # store the shape for the return value
        shape = arr.shape
        # flatten all dimensions of arr except the last
        arr = arr.reshape(-1, arr.shape[-1])
        # preallocate the return value
        ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
        # generate the indices of the resampled values
        idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))
    
        for rep,idx in zip(ret, idxs):
            # iterate over the resampled replicates
            for row,rowrep,i in zip(arr, rep, idx):
                # iterate over the event arrays within a replicate
                rowrep[...] = row[i]
    
        # give the return value the appropriate shape
        return ret.reshape((reps, *shape))
    

    and a solution based on Paul Panzer's fancy indexing approach:

    def resampFancyIdx(arr, reps):
        idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
        _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]
    
        return arr[I, J, K, idx]
    

    I tested with the following data:

    shape = ((10, 11, 50, 100))
    data = np.arange(np.prod(shape)).reshape(shape)
    

    Here's the results from the array flattening approach:

    %%timeit
    resampFlat(data, 100)
    
    1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    the results from the for loop approach:

    %%timeit
    resampFor(data, 100)
    
    1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    and from Paul's fancy indexing:

    %%timeit
    resampFancyIdx(data, 100)
    
    1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Contrary to my expectations, resampFancyIdx beat resampFor, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.