Search code examples
pythonnumpyhdf5h5py

H5Py and storage


I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:

for i in np.arange(numberOfChunks):

   myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation

As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:

for i in np.arange(numberOfChunks):

   myArrayChunk = #... do some calculation to obtain chunk

   saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)

I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:

import h5py

# Make the file
h5py_file = h5py.File(filename, "a")

# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")


for i in np.arange(numberOfChunks):

   myArrayChunk = #... do some calculation to obtain chunk

   myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk

But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?

Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:

import h5py

# Read in the file
h5py_file = h5py.File(filename, "a")

# Read in myArray
myArray = h5py_file['myArray']

for i in np.arange(numberOfChunks):

   # Read in chunk
   myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]

   # ... Do some calculation on myArrayChunk

But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.


Solution

  • You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.

    When reading data with h5py there 2 ways to read the data:
    This returns a NumPy array:
    myArrayNP = myArray[:,:,:]
    This returns a h5py dataset object that operates like a NumPy array:
    myArrayDS = myArray

    The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
    myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]

    My example also corrects 1 small error in your chunksize increment equation. You had:
    myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
    You want:
    myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk

    Working Example (writes and reads):

    import h5py
    import numpy as np
    
    # Make the file
    with h5py.File("SO_61173314.h5", "w") as h5w:
    
        numberOfChunks = 3
        chunkSize = 4
        print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
        # Write dataset to disk
        h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
    
        for i in range(numberOfChunks):
    
           h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
           print (h5ArrayChunk)
    
           h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
    
    
    with h5py.File("SO_61173314.h5", "r") as h5r:
        print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
    
        # Access myArray dataset - Note: This is NOT a NumpPy array
        myArray = h5r['myArray']
    
        for i in range(numberOfChunks):
    
           # Read a chunk into memory (as a NumPy array)
           myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
    
           # ... Do some calculation on myArrayChunk  
           print (myArrayChunk)