Why does parallel reading of an HDF5 Dataset max out at 100% CPU, but only for large Datasets?

I'm using Cython to read a single Dataset from an HDF5 file using 64 threads. Each thread calculates a start index start and chunk size size, and reads from that chunk into a common buffer buf, which is a memoryview of a NumPy array. Crucially, each thread opens its own copy of the file and Dataset. Here's the code:

def read_hdf5_dataset(const char* file_name, const char* dataset_name,
                      long[::1] buf, int num_threads):
    cdef hsize_t base_size = buf.shape[0] // num_threads
    cdef hsize_t start, size
    cdef hid_t file_id, dataset_id, mem_space_id, file_space_id
    cdef int thread
    for thread in prange(num_threads, nogil=True):
        start = base_size * thread
        size = base_size + buf.shape[0] % num_threads \
            if thread == num_threads - 1 else base_size
        file_id = H5Fopen(file_name, H5F_ACC_RDONLY, H5P_DEFAULT)
        dataset_id = H5Dopen2(file_id, dataset_name, H5F_ACC_RDONLY)
        mem_space_id = H5Screate_simple(1, &size, NULL)
        file_space_id = H5Dget_space(dataset_id)
        H5Sselect_hyperslab(file_space_id, H5S_SELECT_SET, &start,
                            NULL, &size, NULL)
        H5Dread(dataset_id, H5Dget_type(dataset_id), mem_space_id,
                file_space_id, H5P_DEFAULT, <void*> &buf[start])
        H5Sclose(file_space_id)
        H5Sclose(mem_space_id)
        H5Dclose(dataset_id)
        H5Fclose(file_id)

Although it reads the Dataset correctly, the CPU utilization maxes out at exactly 100% on a float32 Dataset of ~10 billion entries, even though it uses all 64 CPUs (albeit only at ~20-30% utilization due to the I/O bottleneck) on a float32 Dataset of ~100 million entries. I've tried this on two different computing clusters with the same result. Maybe it has something to do with the size of the Dataset being greater than INT32_MAX?

What's stopping this code from running in parallel on extremely large datasets, and how can I fix it? Any other suggestions to improve the code's clarity or efficiency would also be appreciated.

Solution

Something is happening that is either preventing cython's prange from launching multiple threads, or is preventing the threads from getting anywhere once launched. It may or may not have anything to do directly with hdf5. Here's some possible causes:

Are you pre-allocating a buf large enough to hold the entire dataset before running your function? If so, that means your program is allocating 40+ gigabytes of memory (4 bytes per float32). How much memory do the nodes you're running on have? Are you the only user? Memory starvation could easily cause the kind of performance issues you describe.
Both cython and hdf5 require certain compilation flags in order to correctly support parallelism. Between your small and large dataset runs did you modify or recompile your code at all?
One easy way to explain why your program is using 100% of a single cpu is that it's getting hung somewhere before your read_hdf5_dataset function is ever called. What other code in your program runs first, and could it be causing the problems you see?

Part of the problem here is that it is going to be very hard for any users on this site to reproduce your exact issue, since we don't have most of your program and I at least don't have any 40 gig hdf5 files lying around (back in my grad school days tho, terabytes). If one of my above suggestions doesn't help, I think you have two ways forward:

Try to come up with a simplified repro of your issue, then edit your question to post it here.
Using a combination of debugger and profiler (and print statements, if you're feeling lame), try to track down the exact line your program is getting hung up on when single cpu utilization spins up to 100%. That alone should tell you a whole lot more about what's going on. In particular it should it very clear whether anything is getting locked down by a mutex, as @Homer512 suggested in his comments.