Search code examples
pythonnumpyrandomnumpy-ndarraylis

How to get multi-dimension specific data samples on the basis of list element?


I need to evaluate my model's performance with limited training data. I am randomly selecting p of original training data. Assume p is 0.2 in this case. Here is some intil lines of code:

p = p*100
data_samples = (data.shape[0] * p)/100  # data.shape= (100, 50, 50, 3)

# for randomly selecting data
import random
random.seed(1234)
filter_indices=[random.randrange(0, data.shape[0]) for _ in range(data_samples)]

Its giving me total filter indices randomly ranging between 0 and total data size.

Now, I want to get those samples of indices from the 'data' that are equivalent to filter_indices but include all dimensions. How can I do that effectively and effeciently?


Solution

  • You can use numpy's integer array indexing to use your generated list of indices directly as index. When used on its own, the trailing dimensions will automatically be tacked on to the result! Smaller example:

    import numpy as np
    
    # Your data goes here
    data = np.arange(90).reshape(10, 3, 3)
    
    N = data.shape[0]
    p = 0.2
    
    # Generating random indices
    n_samples = int(N * p)
    np.random.seed(0)
    filter_indices = np.random.choice(N, size=n_samples)
    
    # Indexing magic:
    out = data[filter_indices]
    

    Note above that I've used numpy's built-in random module to streamline your code a little bit via np.random.choice.

    Results:

    >>> filter_indices
    array([5, 0])
    >>> out
    array([[[45, 46, 47],
            [48, 49, 50],
            [51, 52, 53]],
    
           [[ 0,  1,  2],
            [ 3,  4,  5],
            [ 6,  7,  8]]])
    >>> out.shape
    (2, 3, 3)
    

    out is exactly the 2 shape (3, 3) subarrays in data at indices 5 and 0. So the result has shape (2, 3, 3) instead of (10, 3, 3).