Search code examples
pythonpython-3.xnumpynumpy-memmap

How to gradually write large amounts of data to memory?


I am performing a signal processing task on a large dataset of images, converting the images into large feature vectors with a certain structure (number_of_transforms, width, height, depth).

The feature vectors (or coefficients in my code) are too large to keep in memory all at once, so I tried writing them into a np.mmap, like this:

coefficients = np.memmap(
    output_location, dtype=np.float32, mode="w+",
    shape=(n_samples, number_of_transforms, width, height, depth))

for n in range(n_samples):
    image = images[n]
    coefficients_sample = transform(images[n])
    coefficients[n, :, :, :, :] = coefficients_sample

This works for my purpose, with a downside: if I want to load the coefficients of a certain "run" (the transform has to be tested with different hyperparameters) at a later time for analysis, I have to somehow reconstruct the original shape (number_of_transforms, width, height, depth), which is bound to get messy.

Is there cleaner (preferable numpy-compatible) way, allowing me to retain the structure and data type of my transform feature vectors, while still intermittently writing the results of transform to disk?


Solution

  • As @juanpa.arrivillaga pointed out, the only change that needs to be made is using numpy.lib.format.open_memmap instead of np.memmap:

    coefficients = numpy.lib.format.open_memmap(
        output_location, dtype=np.float32, mode="w+",
        shape=(n_samples, number_of_transforms, width, height, depth))
    

    And at a later time, retrieve the data (with correct shape and data type) like so:

    coefficients = numpy.lib.format.open_memmap(output_location)