Search code examples
pythonnumpyone-hot-encodingnoise

numpy shuffle a fraction of sub-arrays


I have one-hot encoded data of undefined shape within an array of ndim = 3, e.g.,:

import numpy as np

arr = np.array([ # Axis 0
    [ # Axis 1
        [0, 1, 0], # Axis 2
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

What I want is to shuffle values for a known fraction of sub-arrays along axis=2.

If this fraction is 0.25, then the result could be:

arr = np.array([
    [
        [1, 0, 0], # Shuffling happened here
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

I know how to do that using iterative methods like:

for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
        if np.random.choice([0, 1, 2, 3]) == 0:
            np.random.shuffle(arr[i][j])

But this is extremely inefficient.

Edit: as suggested in the comments, the random selection of a known fraction should follow an uniform law.


Solution

  • One approach:

    import numpy as np
    
    np.random.seed(42)
    
    fraction = 0.25
    total = arr.shape[0] * arr.shape[1]
    
    # pick arrays to be shuffled
    indices = np.random.choice(np.arange(total), size=int(total * fraction), replace=False)
    
    # convert the each index to the corresponding multi-index
    multi_indices = np.unravel_index(indices, arr.shape[:2])
    
    # create view using multi_indices
    selected = arr[multi_indices]
    
    # shuffle select by applying argsort on random values of the same shape
    shuffled = np.take_along_axis(selected, np.argsort(np.random.random(selected.shape), axis=1), axis=1)
    
    # set the array to the new values
    arr[multi_indices] = shuffled
    print(arr)
    

    Output (of a single run)

    [[[0 1 0]
      [0 0 1]]
    
     [[0 0 1]
      [0 1 0]]]