Search code examples
pythonnumpynumpy-ndarraynumpy-memmap

Shuffling and importing few rows of a saved numpy file


I have 2 saved .npy files:

X_train - (18873, 224, 224, 3) - 21.2GB
Y_train - (18873,) - 148KB

X_train is cats and dogs images (cats being in 1st half and dogs in 2nd half, unshuffled) and is mapped with Y_train as 0 and 1. Thus Y_train is [1,1,1,1,1,1,.........,0,0,0,0,0,0].

I want to import randomly say, 256 images (both cats and dogs images in nearly 50-50%) in X and its mapping in Y. Since the data is large, I cannot import X_train in my RAM.

Thus I have tried (1st approach):

import numpy as np
np.random.seed(666555)
X_train = np.load('Processed/X_train.npy', mmap_mode='r')
X = np.random.shuffle(X_train)
X = X[:256, :, :, :]
Y_train = np.load('Processed/Y_train.npy', mmap_mode='r')
Y = np.random.shuffle(Y_train)
Y = Y[:256]

This gives the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-8b2a13921b8d> in <module>
      2 np.random.seed(666555)
      3 X_train = np.load('Processed/X_train.npy', mmap_mode='r')
----> 4 X = np.random.shuffle(X_train)
      5 X = X[:256, :, :, :]
      6 Y_train = np.load('Processed/Y_train.npy', mmap_mode='r')

mtrand.pyx in numpy.random.mtrand.RandomState.shuffle()

mtrand.pyx in numpy.random.mtrand.RandomState.shuffle()

ValueError: assignment destination is read-only

I have also tried (2nd approach):

import numpy as np
np.random.seed(666555)
X = np.memmap('Processed/X_train.npy', 'float64', shape = (256, 224, 224, 3), mode = 'c')
Y = np.memmap('Processed/Y_train.npy', 'float64', shape = (256), mode = 'c')
X = np.random.shuffle(X)
Y = np.random.shuffle(Y)
print(X)
print(Y)

This outputs:

None
None

In 2nd approach, I will get only cats images as np.memmap will collect only 1st 256 images. Then shuffling will be of no use.

Please tell me how to do this with any method.


Solution

  • your shuffelling procedure is not correct. following this strategy you are also shuffling your X in a different way from Y (there is no more match between X and Y after shuffle). here a demonstrative example:

    np.random.seed(666555)
    xxx = np.asarray([1,2,3,4,5,6,7,8,9])
    yyy = np.asarray([1,2,3,4,5,6,7,8,9])
    np.random.shuffle(xxx)
    np.random.shuffle(yyy)
    
    print((yyy == xxx).all()) # False
    

    here the correct procedure:

    np.random.seed(666555)
    xxx = np.asarray([1,2,3,4,5,6,7,8,9])
    yyy = np.asarray([1,2,3,4,5,6,7,8,9])
    idx = np.arange(0,len(xxx))
    np.random.shuffle(idx)
    
    print((yyy[idx] == xxx[idx]).all()) # True
    

    in this way you also override the None problem