How to optimize sequential writes with h5py to increase speed when reading the file afterwards?

I process some input data which, if I did it all at once, would give me a dataset of float32s and typical shape (5000, 30000000). (The length of the 0th axis is fixed, the 1st varies, but I do know what it will be before I start).

Since that's ~600GB and won't fit in memory I have to cut it up along the 1st axis and process it in blocks of (5000, blocksize). I cannot cut it up along the 0th axis, and due to RAM constraints blocksize is typically around 40000. At the moment I'm writing each block to an hdf5 dataset sequentially, creating the dataset like:

fout = h5py.File(fname, "a")

blocksize = 40000

block_to_write = np.random.random((5000, blocksize))
fout.create_dataset("data", data=block_to_write, maxshape=(5000, None))

and then looping through blocks and adding to it via

fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, -blocksize:] = block_to_write

This works and runs in an acceptable amount of time.

The end product I need to feed into the next step is a binary file for each row of the output. It's someone else's software so unfortunately I have no flexibility there.

The problem is that reading in one row like

fin = h5py.File(fname, 'r')
data = fin['data']
a = data[0,:]

takes ~4min and with 5000 rows, that's way too long!

Is there any way I can alter my write so that my read is faster? Or is there anything else I can do instead?

Should I make each individual row its own data set within the hdf5 file? I assumed that doing lots of individual writes would be too slow but maybe it's better?

I tried writing the binary files directly - opening them outside of the loop, writing to them during the loops, and then closing them afterwards - but I ran into OSError: [Errno 24] Too many open files. I haven't tried it but I assume opening the files and closing them inside the loop would make it way too slow.

Solution

Your question is similar to a previous SO/h5py question I recently answered: h5py extremely slow writing. Apparently you are getting acceptable write performance, and want to improve read performance.

The 2 most important factors that affect h5py I/O performance are: 1) chunk size/shape, and 2) size of the I/O data block. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. I have also found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py. The size of your read data block is certainly large enough.

So, my initial hunch was to investigate chunk size influence on I/O performance. Setting the optimal chunk size is a bit of an art. Best way to tune the value is to enable chunking, let h5py define the default size, and see if you get acceptable performance. You didn't define the chunks parameter. However, because you defined the maxshape parameter, chunking was automatically enabled with a default size (based on the dataset's initial size). (Without chunking, I/O on a file of this size would be painfully slow.) An additional consideration for your problem: the optimal chunk size has to balance the size of the write data blocks (5000 x 40_000) vs the read data blocks (1 x 30_000_000).

I parameterized your code so I could tinker with the dimensions. When I did, I discovered something interesting. Reading the data is much faster when I run it as a separate process after creating the file. And, the default chunk size seems to give adequate read performance. (Initially I was going to benchmark different chunk size values.)

Note: I only created a 78GB file (4_000_000 columns). This takes >13mins to run on my Windows system. I didn't want to wait 90mins to create a 600GB file. You can modify n_blocks=750 if you want to test 30_000_000 columns. :-) All code at the end of this post.

Next I created a separate program to read the data. Read performance was fast with the default chunk size: (40, 625). Timing output below:

Time to read first row: 0.28 (in sec)
Time to read last row:  0.28

Interestingly, I did not get the same read times with every test. Values above were pretty consistent, but occasionally I would get a read time of 7-10 seconds. Not sure why that happens.

I ran 3 tests (In all cases block_to_write.shape=(500,40_000)):

default chunksize=(40,625) [95KB]; for 500x40_000 dataset (resized)
default chunksize=(10,15625) [596KB]; for 500x4_000_000 dataset (not resized)
user defined chunksize=(10,40_000) [1.526MB]; for 500x4_000_000 dataset (not resized)

Larger chunks improves read performance, but speed with default values is pretty fast. (Chunk size has a very small affect on write performance.) Output for all 3 below.

dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28

dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06

dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02

Code to create my test file below:

with h5py.File(fname, 'w') as fout:
    blocksize = 40_000
    n_blocks = 100
    n_rows = 5_000
    block_to_write = np.random.random((n_rows, blocksize))
    start = time.time()
    for cnt in range(n_blocks):
        incr = time.time()
        print(f'Working on loop: {cnt}', end='')
        if "data" not in fout:
            fout.create_dataset("data", shape=(n_rows,blocksize), 
                        maxshape=(n_rows, None)) #, chunks=(10,blocksize))            
        else:    
            fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
        
        fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
        print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')

Code to read 2 different arrays from the test file below:

with h5py.File(fname, 'r') as fin:
    print(f'dataset shape: {fin["data"].shape}')
    print(f'dataset chunkshape: {fin["data"].chunks}')
    start = time.time()
    data = fin["data"][0,:]
    print(f'Time to read first row: {time.time()-start:.2f}')
    start = time.time()
    data = fin["data"][-1,:]
    print(f'Time to read last row: {time.time()-start:.2f}'