Search code examples
pythonparallel-processingpython-multiprocessinghdf5h5py

Multiprocessing: writing into a hdf5 file


I am running a parallelized code in Python and I am trying to save some values within each iteration. My code could be simplified/summarized as follows:

# Import necessary libraries

def func(a,b):
    # Generate some data and save it into "vector".
    
    # Create a Hdf5 file and save data in vector.
    with h5py.File('/some_file.hdf5', 'w') as f:

        f.create_dataset('data_set', data=vector)

# Some code

# Parallelize func
if __name__ == '__main__':
    with mp.Pool(2) as p:
        [p.apply_async(func, args=(elem, b)) for elem in big_array]

I am saving the files while parallelizing to save memory, since I will be working with big amounts of data.

However, every time I run the script, no hdf5 file is generated and the data is not saved.

I am pretty new to Parallelization with Python and I do not understand what the problem is.


Solution

  • In the end I changed the with command (last two lines) by the following:

    p = mp.Pool(2)
    result = [p.apply_async(func, args=(elem, b)) for elem in big_array]
    p.close()
    p.join()
    

    and it worked!

    It seems the previous code, with the with command, basically leaves the for loop when the tasks are assigned to each processor and leaves the loop before all calculations are done.