Search code examples

How to parallel the following code using Multiprocessing in Python

I have a function generate(file_path) which returns an integer index and a numpy array. The simplified of generate function is as follows:

def generate(file_path):
  temp = np.load(file_path)
  #get index from the string file_path
  idx = int(file_path.split["_"][0])
  #do some mathematical operation on temp
  result = operate(temp)
  return idx, result

I need to glob through a directory and collect the results of generate(file_path) into a hdf5 file. My serialization code is as follows:

for path in glob.glob(directory):
    idx, result = generate(path)

    hdf5_file["results"][idx,:] = result

I hope to write a multi-thread or multi-process code to speed up the above code. How could I modify it? Pretty thanks!

My try is to modify my generate function and to modify my "main" as follows:

def generate(file_path):
    temp = np.load(file_path)
    #get index from the string file_path
    idx = int(file_path.split["_"][0])
    #do some mathematical operation on temp
    result = operate(temp)
    hdf5_path = "./result.hdf5"
    hdf5_file = h5py.File(hdf5_path, 'w')
    hdf5_file["results"][idx,:] = result


if __name__ == '__main__':
    ##construct hdf5 file
    hdf5_path = "./output.hdf5"
    hdf5_file = h5py.File(hdf5_path, 'w')
    hdf5_file.create_dataset("results", [2000,15000], np.uint8)


    path_ = "./compute/*"
    p = Pool(mp.cpu_count()), glob.glob(path_))

However, it does not work. It will throw error

KeyError: "Unable to open object (object 'results' doesn't exist)"


  • You can use a thread or process pool to execute multiple function calls concurrently. Here is an example which uses a process pool:

    from concurrent.futures import ProcessPoolExecutor
    from time import sleep
    def generate(file_path: str) -> int:
        return file_path.split("_")[1]
    def main():
        file_paths = ["path_1", "path_2", "path_3"]
        with ProcessPoolExecutor() as pool:
            results =, file_paths)
            for result in results:
                # Write to the HDF5 file
    if __name__ == "__main__":

    Note that you should not write to the same HDF5 file concurrently, i.e. the file writing should not append in the generate function.