python numpy memory deep-learning large-data

I want to read a large amount of images for deep learning, but what is the solution for when there is insufficient memory?

In deep learning program written in python, I wanted to store a large amount of image data in numpy array at once, and to extract batch data randomly from that array, but the image data is too large and memory is run out. How should we deal with such cases? I have no choice but to do IO processing and read image data from storage every time you retrieve batch data?

Solution

File I/O would solve the issue, but will slow down the leanring process, since FILE I/O is a task which takes long.

However, you could try to implement a mixture of both using multithreading, e.g.

https://github.com/stratospark/keras-multiprocess-image-data-generator

(I do not know what kind of framework you are using).

Anyhow back to basic idea:

Pick some random files and read them, start training. During training start a second thread which will read out random Files again. Thus, you learning thread does not have to wait for new data, since the training process might take longer than the reading process.

Some frameworks have this feature already implemented, check out:

https://github.com/fchollet/keras/issues/1627

or:

https://github.com/pytorch/examples/blob/master/mnist_hogwild/train.py