I am performing a signal processing task on a large dataset of images, converting the images into large feature vectors with a certain structure (number_of_transforms, width, height, depth)
.
The feature vectors (or coefficients
in my code) are too large to keep in memory all at once, so I tried writing them into a np.mmap
, like this:
coefficients = np.memmap(
output_location, dtype=np.float32, mode="w+",
shape=(n_samples, number_of_transforms, width, height, depth))
for n in range(n_samples):
image = images[n]
coefficients_sample = transform(images[n])
coefficients[n, :, :, :, :] = coefficients_sample
This works for my purpose, with a downside: if I want to load the coefficients of a certain "run" (the transform
has to be tested with different hyperparameters) at a later time for analysis, I have to somehow reconstruct the original shape (number_of_transforms, width, height, depth)
, which is bound to get messy.
Is there cleaner (preferable numpy-compatible) way, allowing me to retain the structure and data type of my transform
feature vectors, while still intermittently writing the results of transform
to disk?
As @juanpa.arrivillaga pointed out, the only change that needs to be made is using numpy.lib.format.open_memmap
instead of np.memmap
:
coefficients = numpy.lib.format.open_memmap(
output_location, dtype=np.float32, mode="w+",
shape=(n_samples, number_of_transforms, width, height, depth))
And at a later time, retrieve the data (with correct shape and data type) like so:
coefficients = numpy.lib.format.open_memmap(output_location)