Search code examples
pythonnumpygoogle-colaboratory

How to use numpy file without importing into RAM?


I want to use an numpy file (.npy) from Google Drive into Google Colab without importing it into the RAM.

I am working on Image Classification and have my image data into four numpy files in Google Drive. The collective size of the files is greater than 14 GB. Whereas Google Colab only offers 12 GB RAM for usage. Is there a way through which I can use it by loading only single batch at a time into the ram to train the model and removing it from the ram (maybe similar to flow_from_directory)?

The problem using flow_from_directory is that it is very slow even for one block of VGG16 even if I have images in Colab directory.

I am using Cats vs Dogs Classifier dataset from Kaggle.

! kaggle competitions download -c 'dogs-vs-cats'

I converted the image data into numpy array, and saved it in 4 files:

X_train - float32 - 10.62GB - (18941, 224, 224, 3)

X_test - float32 - 3.4GB - (6059, 224, 224, 3)

Y_train - float64 - 148KB - (18941)

Y_test - float64 - 47KB - (6059)

When I run the following code, the session crashes showing 'Your session crashed after using all available RAM.' error.

import numpy as np
X_train = np.load('Cat_Dog_Classifier/X_train.npy')
Y_train = np.load('Cat_Dog_Classifier/Y_train.npy')
X_test = np.load('Cat_Dog_Classifier/X_test.npy')
Y_test = np.load('Cat_Dog_Classifier/Y_test.npy')

Is there any way to use these 4 files without loading it into the RAM?


Solution

  • You can do this by opening your file as a memory-mapped array.

    For example:

    import sys
    import numpy as np
    
    # Create a npy file
    x = np.random.rand(1000, 1000)
    np.save('mydata.npy', x)
    
    # Load as a normal array
    y = np.load('mydata.npy')
    sys.getsizeof(y)
    # 8000112
    
    # Load as a memory-mapped array
    y = np.load('mydata.npy', mmap_mode='r')
    sys.getsizeof(y)
    # 136
    

    The second array acts like a normal array, but is backed by disk rather than RAM. Be aware that this will cause operations over the arrays to be much slower than normal RAM-backed arrays; often mem-mapping is used to conveniently access portions of the array without having to load the full array into RAM.