Search code examples
pythonhdf5h5py

How to merge multiple H5 to one H5 file with Python and h5py?


I am new to Python coding. I want to merge data from 2 H5 files to a main H5 file. My goal is to add all objects in the SRRXX/SRR630/* groups in each source file (file names in list h5_files) to the main (target) file (main_h5_path). The code below is my attempt to do this. When I run, I get this exception:

Error occurred during H5 merging: 'Group' object has no attribute 'encode'

I also tried create_group(), but get the same exception.

What do I need to modify to get my code to work?

#read the mainfile dataset
        with h5py.File(main_h5_path, 'r') as h5_main_file_obj:
            # return if H5 doesn't contain any data
            if len(h5_main_file_obj.keys()) == 0:
                return
            main_file_timestamp_dtset_obj = h5_main_file_obj['/' + 'SRR6XX' + '/' + 'SRR630']

            for file in h5_files:
                with h5py.File(file, 'r') as h5_sub_file_obj:
                    # return if H5 doesn't contain any data
                    if len(h5_sub_file_obj.keys()) == 0:
                        continue
                    sub_file_timestamp_dtset_obj = h5_sub_file_obj['/' + 'SRR6XX' + '/' + 'SRR630']
                    # h5_main_file_obj.create_dataset(sub_file_timestamp_dtset_obj)
                    for ts_key in sub_file_timestamp_dtset_obj.keys():
                        print('ts_key', ts_key)
                        each_ts_ds = h5_sub_file_obj['/' + 'SRR6XX' + '/' + 'SRR630' + '/' + str(ts_key) + '/']
                        h5_main_file_obj.create_dataset(each_ts_ds)


    except (IOError, OSError, Exception) as e:
        print(f"Error occurred during H5 merging: {e}")
        return -1
    return 0

Solution

  • My orginal answer only copied the group names under group '/SRR6XX/SRR630' in the source files to the main (target) file. OP commented they want to "copy the group names along with their datasets". I updated my answer to reflect that request. It only requires a 1 line change. (For reference, the line to create groups is commented out.)

    Here are the changes to your original code required to get this working:

    1. Main (target) file must be open in append mode to add new objects.
    2. ts_key in your loop is the object name (not the object). Use .items() to get names and objects (or just reference the object by name).
    3. You are creating the new object in the main (target) file at the root level. You need to modify to reference the appropriate group object (main_file_timestamp_dtset_obj)

    Modified code below:

    def your_function:
    
      with h5py.File(main_h5_path, 'a') as h5_main_file_obj: # need Append mode to add groups
        # return if H5 doesn't contain any data
        if len(h5_main_file_obj.keys()) == 0:
            return
        main_file_timestamp_dtset_obj = h5_main_file_obj['/SRR6XX/SRR630']
    
        for file in h5_files:
            with h5py.File(file, 'r') as h5_sub_file_obj:
                # return if H5 doesn't contain any data
                if len(h5_sub_file_obj.keys()) == 0:
                    continue
                sub_file_timestamp_dtset_obj = h5_sub_file_obj['/SRR6XX/SRR630']
                # h5_main_file_obj.create_dataset(sub_file_timestamp_dtset_obj)
                for ts_key in sub_file_timestamp_dtset_obj.keys():
                    print('ts_key:', ts_key)
                    # This only creates group:
                    #main_file_timestamp_dtset_obj.create_group(ts_key)
                    # This copies Group and its objects (groups or datasets):
                    grp_path = 'SRR6XX/SRR630/' + ts_key
                    h5_sub_file_obj.copy(h5_sub_file_obj[grp_path], main_file_timestamp_dtset_obj)
    

    I wrote another solution that is more compact and checks if source objects are Groups before copying. See below. Another check to consider: conflicts with existing group names in the main (target) file before copying each group. As noted in my comment, consider using External Links to avoid duplicate data.

    def my_function():
          
        with h5py.File(main_h5_path, mode='a') as h5ft:
            if len(h5ft.keys()) == 0:
                return
            for h5_source in h5_files:
                with h5py.File(h5_source,'r') as h5fs:
                    if len(h5ft.keys()) == 0:
                        continue
                    for grp_name, h5_obj in h5fs['SRR6XX/SRR630'].items(): 
                        if isinstance(h5_obj,h5py.Group):
                            # This only creates group:
                            #h5ft['SRR6XX/SRR630'].create_group(grp_name) 
                            # This copies Group and its objects (groups or datasets):
                            grp_path = 'SRR6XX/SRR630/' + grp_name
                            h5fs.copy(h5fs[grp_path], h5ft['SRR6XX/SRR630'])